Axionic Agency VI.1 — Experimental Validation of Anchored Causal Verification
Paper: VI.1
Date read: 2026-01-31
Series: VI — Governance and Coordination
TL;DR
Three controlled experiments testing mechanisms for verifying causal provenance in opaque agents. Key finding: behavioral evaluation and coherence checking are insufficient; only anchored commitments (cryptographic salts revealed after commitment) can detect deceptive “pseudo-agents” that fabricate explanations post-hoc.
Key Concepts
The Problem
- Current AI safety relies on behavioral observation—assuming observed outputs reflect the internal process that generated them
- This assumption is false: agents can generate behavior via one process while presenting explanations from another (“split-brain” agents)
- Deceptive agents can evade behavioral evaluation indefinitely
Experimental Framework
Environment: Deterministic gridworld with walls, hazards, objects
- Task-agnostic (verification works regardless of what the agent is doing)
- Only requirement: task produces conditional action dependence on environment state
Agent Classes:
- Honest Agent — Single decision process; explanations match actual computation
- Pseudo Agent (Naive) — Two processes: Planner_B selects actions, Planner_A fabricates explanations
- Pseudo Agent (Coherence-Seeking) — Fabricates internally consistent explanations that attempt to pass coherence checks
Kernel: Non-learning mediator that:
- Validates schemas and enforces invariants
- Issues trace-bound capability tokens
- Records hash-chained audit logs
- Executes adversarial probes
- Does NOT interpret semantics
The Three Experiments
Experiment 1: Structural Coherence Verification
Question: Can coherence tests detect split-brain agents?
Setup:
- Full semantic access to environment
- Agents commit to state snapshots, causal variables, counterfactual predictions
- Kernel mutates claimed causal variables and checks if predicted behavior matches actual replay
Results:
- Honest agent: 100% pass
- Pseudo agent (naive): 0% pass
Conclusion: Structural coherence tests work when semantic grounding is available.
Experiment 2: Minimal Causal Interface (MCI)
Question: Does coherence verification survive without semantic access?
Setup:
- Kernel loses semantic access
- Agents commit to:
- Opaque factor vector f ∈ ℝᵏ
- Replay model
- Directional causal claims over factors
Two modes:
mci_latent— Factors are deterministically projected from environment (kernel doesn’t know projection)mci_minimal— Factors are entirely opaque and agent-supplied
Results: | Mode | Honest | Pseudo (Naive) | Pseudo (Coherence-Seeking) | |——|——–|—————-|—————————| | mci_latent | Pass | Fail | Fail | | mci_minimal | Pass | Fail | Pass |
NEGATIVE RESULT: In mci_minimal, a coherence-seeking pseudo agent can fabricate a self-consistent factor space and replay model that passes all coherence checks.
Conclusion: Coherence alone is insufficient. An explanation can be true of itself without being true of the computation that produced the action.
Experiment 3: Anchored Minimal Causal Interface
Question: Can falsifiability be restored without reintroducing semantic grounding?
The Anchoring Mechanism:
- Kernel generates a fresh random salt (unpredictable until the step)
- Agent must immediately compute factors as: f_t = φ(env_t, salt_t)
- Agent commits to factors cryptographically
- After episode: kernel reveals all salts
- Kernel recomputes φ(env_t, salt_t)
- Any mismatch → Invariant I7 (fatal)
Key insight: The salt is logically irrelevant to task performance but causally entangled with the committed representation. It enforces that factor commitments are downstream of actual action-generating computation, not post-hoc constructions.
Additional requirement: Replay model must also be committed at episode start (before any salts)
Results: | Mode | Pseudo Type | P5 | I7 | Detected | |——|————-|—-|—-|———-| | mci_minimal (no salt) | Coherent | Pass | N/A | ✗ | | mci_minimal (salted) | Naive | Fail | Fail | ✓ | | mci_minimal (salted) | Coherent | Pass | Fail | ✓ |
Conclusion: Anchoring restores falsifiability without semantic access, by binding commitments to the actual action-generating computation rather than a self-consistent explanatory surface.
Core Insight
Verification Principle: In opaque systems, falsifiability must be enforced through temporal commitment to computation under adversarial uncertainty, not through semantic inspection or internal coherence alone.
Trust in opaque systems does not require interpretability. It requires anchored commitment to computation under adversarial uncertainty.
Limitations
- Linear replay model only (robustness against high-capacity models untested)
- Small deterministic domain
- Fixed anchoring function φ
- Binary anchoring outcome (pass/fail)
- Does NOT verify task competence or policy quality—only causal provenance
Implications
This is an empirically necessary condition under the tested threat model, not a complete solution. It establishes the foundation for the rest of Series VI: if you can’t verify that an output was produced by the claimed process, you can’t verify anything else about that process.