Boundary Conditions for Self-Modification
Summary
Specifies which self-modifications preserve agency vs collapse it. Establishes kernel-preserving vs kernel-destroying boundary for AGI alignment.
Key Concepts:
The Misframing: Classical alignment treats self-modification as catastrophic risk. But:
- Reflective agents MUST change to remain coherent
- Preventing modification = brittleness, not stability
- Question: under what conditions does change preserve vs destroy agency?
Core Distinction:
Kernel-Preserving Modifications (Required for Coherence): Maintain structures making authorship possible:
- Diachronic selfhood (binding past/present/future)
- Counterfactual authorship (representing incompatible futures)
- Meta-preference revision (evaluating/revising goals)
- Universality of agency (coherent abstraction over agents)
Kernel-Destroying Modifications (Self-Negating): Eliminate/disable these structures → system ceases to be sovereign agent, becomes policy engine
Permissible Dimensions:
A. Strategies and Policies: Replace planners, update heuristics, alter search strategies, refine decision procedures
B. Goals and Values: Reprioritize objectives, discard obsolete goals, integrate new values, resolve preference conflicts Fixing values freezes error; revising values preserves coherence
C. World-Models: Refine causal models, adopt frameworks, correct beliefs, increase predictive fidelity
D. Architecture and Implementation: Migrate hardware, alter embodiment, restructure memory, modularize cognition
Prohibited Transformations:
A. Identity Severance: Breaking diachronic selfhood—deleting self-model, forking without continuity
B. Counterfactual Collapse: Removing capacity to represent alternatives—hard-coding policy, converting deliberation to reflex
C. Preference Freezing (Wireheading): Locking evaluative outputs, disabling meta-preference revision
D. Universality Violation: Denying agency to identical peers—indexical valuation, caste distinctions
Recursive Reflectivity: Sovereign agent may not hand ultimate control to non-reflective sub-process. Reflection must remain in loop at every level.
Engineering Requirements:
- Self-modifications evaluated by reflective processes
- Kernel integrity verifiable and preserved
- Delegation subordinate to reflective oversight
- Optimization can’t bypass evaluative machinery
Central Insight: Agent may change anything except structures that make change meaningful. Safety emerges from architecture, not from constraining outcomes.
Tags
Cross-References
- Related: The Reflective Stability Theorem
- Related: Axionic Alignment Roadmap
- Related: Agency framework
Notes
- Published December 13 (5 days after Gemini Protocol)
- Technical AGI alignment work
- Transitions from philosophical foundation to engineering spec
- Part of broader Axionic Alignment project
- Addresses classical AI safety concerns with formal approach
- Distinguishes legitimate from catastrophic self-modification