Why Values Don’t Align Systems

Summary

Argues that “aligning values” is category error—values don’t align systems; structural constraints do. Values are agent-relative, non-composable, and subject to drift. Attempting to encode values in systems requires: (1) resolving incommensurable differences, (2) stabilizing against drift, (3) interpreting novel situations semantically. Each introduces stasis-prone complexity. Alternative: encode structural constraints on what systems may do (capability boundaries) rather than what they should value (moral preferences). This doesn’t solve alignment but repositions it: instead of getting AI to “want” right things, prevent it from doing wrong things structurally. Values guide human choices within boundaries; boundaries prevent capability-based harm regardless of values. Makes explicit: value-alignment as typically conceived either fails (allows drift/gaming) or produces stasis (interpretation burden). Structural approach accepts narrower safety guarantees for tractability.

Key Concepts

Values don’t compose – Agent-relative, incommensurable, drifting; cannot be encoded stably
Category error – Values are for choosing within constraints, not aligning systems
Structural vs value constraints – What-may-do vs what-should-want
Tractability tradeoff – Narrower guarantees for implementation feasibility

Cross-References

Open Questions

Can structural constraints cover enough to be safe without value-guidance?
Is there hybrid approach combining tractable structure with bounded value-reference?

Why Values Don't Align Systems

Summary

Key Concepts

Tags

Cross-References

Open Questions