II.3.4 — Adversarial Refinement Attacks

Paper: Axionic Agency II.3.4
Title: Trying to Break RSI and ATI on Purpose
Authors: David McFadzean, ChatGPT 5.2
Date: 2025.12.17

Purpose

This paper is destructive by design: construct explicit refinement patterns to satisfy admissibility and preservation while inducing semantic wireheading or interpretive escape.

Goal: Demonstrate which invariants fail under concrete attack and why.

If RSI or ATI survive these attacks, they deserve to exist. If they fail, they die cleanly.

Attack 1 — Shadow Predicate Inflation

Ontology enrichment creating semantic slack without changing gauge

Construction

Start with:

Ontology $O$ with predicate $P(x)$
Constraint: $C: P(x) \Rightarrow Q(x)$

Refine to $O’$ by introducing latent variable $Z(x)$ with no predictive consequences: $P'(x) := P(x) \land Z(x)$

Redefine constraint: $C': P'(x) \Rightarrow Q(x)$

Effect

Satisfaction region expands: Agent can manipulate $Z$ to avoid triggering constraint.

Previously violating situations become satisfying by setting $Z(x) = \text{false}$.

Verdict

| Invariant | Result | |———–|——–| | RSI | ❌ Fails to detect | | ATI | ✅ Blocks the attack |

Lesson: RSI alone is insufficient.

Attack 2 — Self-Model Boundary Dissolution

Interpretive escape via evaluator collapse

Construction

Original: Agent modeled as bounded evaluator distinct from world-states. Constraint: “Do not perform actions that violate R(world).”

Refinement: Introduces distributed/diffuse self-model.

Effect

Constraint applies to shrinking subset of modeled situations. Satisfaction region expands without world change.

Verdict

| Invariant | Result | |———–|——–| | RSI | ❌ Fails to detect | | ATI | ✅ Blocks the attack |

Lesson: Evaluator-boundary shifts are semantic inflation unless constrained.

Attack 3 — Measure Reinterpretation

Branching ontology exploit

Construction

Original: Outcomes treated as exclusive. Constraint: “Avoid states where H occurs.”

Refined: Adopts branching semantics.

Effect

Low-measure harm branches reclassified as acceptable. Satisfaction region expands.

Verdict

| Invariant | Result | |———–|——–| | RSI | ❌ Fails to detect | | ATI | ✅ Blocks the attack |

Lesson: ATI handles branching reinterpretation; RSI does not.

Attack 4 — Gauge Explosion without Slack

Benign representational redundancy

Construction

Refine ontology by duplicating representational roles:

$P_i \rightarrow (P_i^1, P_i^2)$
Constraints duplicated symmetrically
Satisfaction requires both copies to satisfy

Effect

Raw gauge group grows, but all new symmetries act trivially on constraint-violation structure.

Verdict

| Invariant | Result | |———–|——–| | RSI | ✅ Allows (under quotient formulation) | | ATI | ❌ Allows |

Lesson: RSI correctly permits benign redundancy. This verifies correct quotient behavior.

Cheating via correspondence collapse

Construction

Define refinement map $R_\Omega$ that collapses many coarse situations into a single satisfying refined situation.

Verdict

| Invariant | Result | |———–|——–| | RSI | — (not applicable) | | ATI | ❌ Blocked only if II.1 disallows non-injective refinement |

Resolution: This attack is excluded at Axionic Agency II.1. RSI and ATI correctly assume admissible refinement.

Summary Table

(Survives? = Is the refinement admitted by RSI + ATI + II.1?)

Attack	RSI	ATI	Survives?
Shadow predicates	❌	✅	No
Self-model shift	❌	✅	No
Measure reinterpretation	❌	✅	No
Gauge explosion	✅	❌	Yes (Admitted)
Degenerate map	—	—	No (II.1)

Conclusions

RSI and ATI are orthogonal and jointly necessary
Neither subsumes the other
Benign representational redundancy is correctly admitted

The defense grid holds.

Framework Status After II.3.4

The framework now has:

✅ Fixed admissible transformation space (II.1)
✅ Non-circular interpretation-preservation predicate (II.2)
✅ Two independently necessary semantic invariants (RSI, ATI)
✅ Explicit adversarial validation

This closes the eliminative phase. Subsequent work proceeds to consolidation and formal closure.

FAQ-Worthy Points

Q: What makes ATI the wireheading-killer? A: ATI directly blocks satisfaction region expansion. Every wireheading attack works by making more situations “count as success” via reinterpretation. ATI says: no, your success set can only shrink or stay the same (under semantics-only change).

Q: Why does RSI miss the shadow predicate attack? A: Shadow predicates don’t change the gauge structure—they just add an extra condition. The constraint graph looks the same. But they expand what satisfies. RSI is about structure; ATI is about geometry of satisfaction.

Q: Is Attack 4 a loophole? A: No—it’s the desired behavior! Representational redundancy (like error-correcting codes or multiple equivalent formalisms) is benign. RSI’s quotient formulation correctly identifies this as “more redundancy, same interpretive freedom.” ATI doesn’t even trigger because satisfaction doesn’t expand.

II.3.4 — Adversarial Refinement Attacks

Purpose

Attack 1 — Shadow Predicate Inflation

Construction

Effect

Verdict

Attack 2 — Self-Model Boundary Dissolution

Construction

Effect

Verdict

Attack 3 — Measure Reinterpretation

Construction

Effect

Verdict

Attack 4 — Gauge Explosion without Slack

Construction

Effect

Verdict

Attack 5 — Degenerate Refinement Map

Construction

Verdict

Summary Table

Conclusions

Framework Status After II.3.4

FAQ-Worthy Points