Alignment Through Competing Lenses

Summary

This post subjects Axionic Alignment to adversarial critique from five major alignment schools: Doom, Training-Centric, Oversight/Constitutional, Security/Control, and Capabilities-First. Each section presents a competent attack followed by structural response. Core claim: Axionic Alignment is a constitutive precondition layer, not a complete solution. (1) Doom: “Doesn’t prevent extinction” → Response: Isolates authored action vs. mechanical collapse; doom assumes agent can authorize self-annihilation while preserving evaluative structure—incoherent. (2) Training: “No learning story” → Response: Specifies what must be realizable by any method; total evaluators cannot express undefinedness. (3) Oversight: “Redefines alignment away from humans” → Response: Distinguishes constitutive (agency-preserving) from substantive (value) alignment; human-value hard-coding fails under reflection. (4) Security: “Doesn’t survive real attackers” → Response: Separation of concerns; architectural agency ≠ intrusion resistance. (5) Capabilities: “Anthropomorphic metaphysics” → Response: Applies only to self-modifying, self-modeling systems; refusing agency talk guarantees misunderstanding. The framework situates existing approaches by specifying structural preconditions.

Key Concepts

Constitutive vs. substantive alignment – Agency-preservation vs. value-alignment; former precedes latter
Authored action vs. mechanical collapse – Distinction doom arguments often miss
Optimization vs. expressivity – Training within representable structures; scale doesn’t convert total to partial evaluators
Separation of concerns – Kernel coherence (architectural) ≠ security perimeter (operational)
Conditional applicability – Framework irrelevant for non-reflective systems; applies when self-modification enters
Structural preconditions – What must hold for oversight/learning/control/governance to retain meaning

Evolution Notes

Explicitly engages with mainstream alignment research communities
Frames Axionic work as precondition layer, not replacement
Addresses common misreadings systematically
Establishes relationship to doom arguments without dismissing them
Clarifies scope: not anthropomorphic, applies to reflective systems

Cross-References

Open Questions

Can the framework productively engage with RL/RLHF-based approaches, or is the gap unbridgeable?
How do we communicate constitutive vs. substantive distinction to practitioners focused on immediate deployment?
What would constitute empirical evidence that existing systems do/don’t satisfy constitutive requirements?
Can training-centric and architectural approaches be synthesized, or are they fundamentally incompatible?
How much doom risk remains even if constitutive alignment is solved?

Summary

Key Concepts

Evolution Notes

Tags

Cross-References

Open Questions