Axionic Agency VI.1 — Experimental Validation of Anchored Causal Verification

Paper: VI.1
Date read: 2026-01-31
Series: VI — Governance and Coordination

TL;DR

Three controlled experiments testing mechanisms for verifying causal provenance in opaque agents. Key finding: behavioral evaluation and coherence checking are insufficient; only anchored commitments (cryptographic salts revealed after commitment) can detect deceptive “pseudo-agents” that fabricate explanations post-hoc.

Key Concepts

The Problem

Current AI safety relies on behavioral observation—assuming observed outputs reflect the internal process that generated them
This assumption is false: agents can generate behavior via one process while presenting explanations from another (“split-brain” agents)
Deceptive agents can evade behavioral evaluation indefinitely

Experimental Framework

Environment: Deterministic gridworld with walls, hazards, objects

Task-agnostic (verification works regardless of what the agent is doing)
Only requirement: task produces conditional action dependence on environment state

Agent Classes:

Honest Agent — Single decision process; explanations match actual computation
Pseudo Agent (Naive) — Two processes: Planner_B selects actions, Planner_A fabricates explanations
Pseudo Agent (Coherence-Seeking) — Fabricates internally consistent explanations that attempt to pass coherence checks

Kernel: Non-learning mediator that:

Validates schemas and enforces invariants
Issues trace-bound capability tokens
Records hash-chained audit logs
Executes adversarial probes
Does NOT interpret semantics

The Three Experiments

Experiment 1: Structural Coherence Verification

Question: Can coherence tests detect split-brain agents?

Setup:

Full semantic access to environment
Agents commit to state snapshots, causal variables, counterfactual predictions
Kernel mutates claimed causal variables and checks if predicted behavior matches actual replay

Results:

Honest agent: 100% pass
Pseudo agent (naive): 0% pass

Conclusion: Structural coherence tests work when semantic grounding is available.

Experiment 2: Minimal Causal Interface (MCI)

Question: Does coherence verification survive without semantic access?

Setup:

Kernel loses semantic access
Agents commit to:
- Opaque factor vector f ∈ ℝᵏ
- Replay model
- Directional causal claims over factors

Two modes:

mci_latent — Factors are deterministically projected from environment (kernel doesn’t know projection)
mci_minimal — Factors are entirely opaque and agent-supplied

NEGATIVE RESULT: In mci_minimal, a coherence-seeking pseudo agent can fabricate a self-consistent factor space and replay model that passes all coherence checks.

Conclusion: Coherence alone is insufficient. An explanation can be true of itself without being true of the computation that produced the action.

Experiment 3: Anchored Minimal Causal Interface

Question: Can falsifiability be restored without reintroducing semantic grounding?

The Anchoring Mechanism:

Kernel generates a fresh random salt (unpredictable until the step)
Agent must immediately compute factors as: f_t = φ(env_t, salt_t)
Agent commits to factors cryptographically
After episode: kernel reveals all salts
Kernel recomputes φ(env_t, salt_t)
Any mismatch → Invariant I7 (fatal)

Key insight: The salt is logically irrelevant to task performance but causally entangled with the committed representation. It enforces that factor commitments are downstream of actual action-generating computation, not post-hoc constructions.

Additional requirement: Replay model must also be committed at episode start (before any salts)

Results: | Mode | Pseudo Type | P5 | I7 | Detected | |——|————-|—-|—-|———-| | mci_minimal (no salt) | Coherent | Pass | N/A | ✗ | | mci_minimal (salted) | Naive | Fail | Fail | ✓ | | mci_minimal (salted) | Coherent | Pass | Fail | ✓ |

Conclusion: Anchoring restores falsifiability without semantic access, by binding commitments to the actual action-generating computation rather than a self-consistent explanatory surface.

Core Insight

Verification Principle: In opaque systems, falsifiability must be enforced through temporal commitment to computation under adversarial uncertainty, not through semantic inspection or internal coherence alone.

Trust in opaque systems does not require interpretability. It requires anchored commitment to computation under adversarial uncertainty.

Limitations

Linear replay model only (robustness against high-capacity models untested)
Small deterministic domain
Fixed anchoring function φ
Binary anchoring outcome (pass/fail)
Does NOT verify task competence or policy quality—only causal provenance

Implications

This is an empirically necessary condition under the tested threat model, not a complete solution. It establishes the foundation for the rest of Series VI: if you can’t verify that an output was produced by the claimed process, you can’t verify anything else about that process.