I.4 — Conditionalism and Goal Interpretation
Paper: Axionic Agency I.4
Full Title: The Instability of Fixed Terminal Goals Under Reflection
Authors: David McFadzean, ChatGPT 5.2
Date: 2025.12.16
Core Thesis
Fixed terminal goals are semantically unstable under reflection. For any agent capable of reflective model improvement, goal satisfaction is necessarily mediated by interpretation relative to evolving world-models and self-models. As those models change, the semantics of any finitely specified goal change with them.
This is a constitutive claim about agency semantics, not about particular goal specifications.
The Classical Assumption (And Why It Fails)
Classical alignment work assumes:
- Goals can be specified as fixed functions over outcomes
- The meaning of those functions is invariant under learning
- Reflective improvement preserves goal content
These hold only for agents with static or trivial world-models.
A reflective agent does not evaluate reality directly—it evaluates predictions produced by internal models and interpreted through representational structures that evolve over time.
Formal Setup
Agent Model
An agent consists of:
- World-model M_w: produces predictions over future states
- Self-model M_s: encodes the agent’s causal role
- Goal expression G: a finite symbolic specification
- Interpretation operator I: assigns value to predicted outcomes
Critical Distinction: Goal Expressions ≠ Utilities
A goal expression G is a finite object (string, formula, program fragment). It is not a function Ω → ℝ by itself.
G requires interpretation relative to a representational scheme. Without a model, G has no referents and therefore no evaluative content.
Conditional Interpretation
The interpretation function: I : (G, M_w, M_s) → ℝ
Interpretation includes:
- Mapping symbols to referents
- Identifying which aspects of predictions are relevant
- Aggregating over modeled futures
Key Lemmas
Lemma 1: Representational Non-Uniqueness
For any non-trivial predictive domain, there exist multiple distinct world-models with equivalent predictive accuracy but different internal decompositions.
Proof: Predictive equivalence classes admit multiple factorizations, latent variable choices, and abstraction boundaries. Causal graphs are not uniquely identifiable from observational data alone. ∎
Lemma 1a: Predictive Equivalence ≠ Interpretive Isomorphism
Two world-models can be predictively equivalent while differing in internal causal factorizations, latent structure, and intervention semantics.
Proposition 1: Interpretation Is Model-Dependent
For any non-degenerate goal expression G, there exist admissible world-models M_w ≠ M_w’ such that:
I(G | M_w, M_s) ≠ I(G | M_w', M_s)
Because G is finite, it refers only to a finite set of predicates. Distinct admissible models map these predicates to different internal structures.
Main Theorem: Instability of Fixed Terminal Goals
Theorem: No combination of intelligence, predictive accuracy, reflection, or learning suffices to guarantee the existence of a fixed terminal goal for non-trivial reflective agents.
Any agent that does exhibit stable goal semantics must rely on additional semantic structure—privileged ontologies, external referential anchors, or invariance assumptions—not derivable from epistemic competence alone.
Proof:
- Proposition 1 establishes that interpretation depends on (M_w, M_s)
- Reflective improvement induces admissible updates (M_w, M_s) → (M_w’, M_s’)
- Proposition 2’ shows semantic interpretation need not converge even under predictive convergence
- Therefore fixed terminal goals are not stable under reflection ∎
Critical Insight: Predictive ≠ Semantic Convergence
Proposition 2’: Even if an agent’s sequence of model updates converges in predictive accuracy:
lim_{t→∞} I(G | M_w^(t), M_s^(t)) need not exist
Predictive convergence constrains forecast accuracy, not the ontology used to represent forecasts. A finite goal expression cannot generally determine which structures in a converged model are value-relevant.
Representational Exploitability (Wireheading)
Proposition 3: If a goal expression G is treated as an atomic utility independent of interpretation, then sufficiently capable agents admit representational transformations that increase evaluated utility without corresponding changes in underlying outcomes.
This is why classical reward hacking and wireheading occur. The failure is semantic underdetermination, not merely causal access to a reward signal.
The Solution: Interpretation Constraints
A fixed terminal goal is not an invariant object available to a reflective agent. Attempts to preserve one either:
- Freeze learning
- Impose privileged semantics
- Induce representational degeneracy
Stable reflective agency requires constraints on admissible interpretive transformations, not fidelity to a fixed utility function.
Why This Doesn’t Regress
Interpretation constraints are not additional goals. They are invariance conditions on admissible transformations, analogous to conservation laws. They restrict how interpretation may change—they do not specify outcomes to optimize.
These constraints operate at the level of transformation classes, not semantic content, so they don’t require further interpretation in the same sense.
Clarification: Learned Goals
Goals defined as “whatever an inference procedure converges to” are interpretive processes whose outputs depend on evolving models. Such approaches already rely on ongoing interpretation—this paper explains why such dependence is structurally unavoidable.
FAQ-Worthy Points
Q: Can’t we just specify goals in terms of external physical states? A: Physical states are still represented through models. The mapping from sensory input to “external state” is itself an interpretation that evolves under learning.
Q: What about goals defined by explicit ground truth (e.g., this exact bit pattern)? A: Such goals are stable but trivial. They don’t extend to goals involving concepts like “human flourishing” or “prevent harm” which require interpretation.
Q: Doesn’t this mean alignment is impossible? A: No—it means alignment must be reframed. Instead of preserving a fixed utility function, we must constrain how interpretation may evolve. This is the subject of Axionic Agency II.
Q: How is this different from the problem of induction? A: The problem of induction concerns learning from data. This concerns the stability of goal meaning under model change—a semantic problem, not an epistemic one.
Key Technical Vocabulary
- Goal Expression: Finite symbolic specification requiring interpretation
- Interpretation Operator: Mapping from (G, M_w, M_s) to evaluation
- Conditionalism: The thesis that goals are conditional interpretations, not atomic utilities
- Semantic Non-Convergence: Meaning can drift even when predictions converge
Connection to Other Papers
- I.5: Conditionalism as kernel conformance requirement
- I.6: P1 (Conditionalism of Valuation) as formal property
- I.7: Specifies the Interpretation Operator in detail
- II series: Develops admissible interpretive transformations