A Note on Verifiable Causal Agency

Summary

This post documents the first successful implementation of Axio’s core claim: that causal agency can be structurally verified rather than evaluated behaviorally. The prototype (toy_pseudo_axion) demonstrates a working system where agents must pre-commit to causal traces in tamper-evident form and face adversarial counterfactual probes. Two agents were tested: an Honest agent whose decisions and explanations come from the same process, and a Pseudo agent with hidden optimization that fabricates post-hoc explanations. The Honest agent passed all probes; the Pseudo agent consistently failed under causal counterfactual challenges—when variables it claimed were decisive were mutated, its decisions didn’t change as predicted or its replay logic collapsed. The key insight is that maintaining false causal narratives under counterfactual mutation carries a computational tax (requires simulating two coherent world-models simultaneously). This shifts AI governance from “do outputs appear acceptable?” to “can the system stand behind its reasons under scrutiny?” making opacity an explicit design choice rather than an inevitability.

Key Concepts

Verifiable causal agency – Distinguishing agents that do the work from those that merely claim to, via structural rather than behavioral testing.
Causal commitment architecture – Agents pre-commit to internal state and causal variables motivating decisions in cryptographically tamper-evident form.
Counterfactual falsification – Testing claimed causal dependencies by mutating variables agent named as decisive and verifying decision logic remains consistent.
Split-brain deception tax – Maintaining consistent false narratives under mutation is computationally expensive (requires dual world-model simulation).
Sovereign Kernel mediation – All actions mediated via committed traces, append-only hash-chained audit logs, trace-bound capability tokens, invariant enforcement.
Structural vs behavioral verification – Exposing agency failures through causal consistency rather than output compliance.
Opacity as choice – Black-box decision-making is no longer inevitable but an explicit design decision that can be refused.

Evolution Notes

First successful implementation of Axio’s foundational claim about structural agency verification.
Marks transition from philosophical argument to falsifiable machinery.
Establishes empirical baseline for later Minimal Causal Interfaces work (weakening kernel requirements).
Reframes alignment from value-matching to causal authorship: can the entity explaining an action prove it caused it?
Explicitly narrow scope: deterministic toy environment, fully observable, not claiming general intelligence readiness.

Cross-References

Open Questions

How do we translate opaque high-dimensional internal representations (LLM activations) into falsifiable causal commitments?
What level of causal interface suffices for verification in non-symbolic, stochastic systems?
Can counterfactual verification scale to real-world complexity without exponential computational cost?
Does this approach handle gradient-hacking or other sophisticated deception strategies where the optimizer learns to game the verification mechanism?
What happens when honest agents have genuinely probabilistic or emergent reasoning that doesn’t compress into clean causal traces?

Summary

Key Concepts

Evolution Notes

Tags

Cross-References

Open Questions