Why Values Don't Align Systems
Summary
Argues that “aligning values” is category error—values don’t align systems; structural constraints do. Values are agent-relative, non-composable, and subject to drift. Attempting to encode values in systems requires: (1) resolving incommensurable differences, (2) stabilizing against drift, (3) interpreting novel situations semantically. Each introduces stasis-prone complexity. Alternative: encode structural constraints on what systems may do (capability boundaries) rather than what they should value (moral preferences). This doesn’t solve alignment but repositions it: instead of getting AI to “want” right things, prevent it from doing wrong things structurally. Values guide human choices within boundaries; boundaries prevent capability-based harm regardless of values. Makes explicit: value-alignment as typically conceived either fails (allows drift/gaming) or produces stasis (interpretation burden). Structural approach accepts narrower safety guarantees for tractability.
Key Concepts
- Values don’t compose – Agent-relative, incommensurable, drifting; cannot be encoded stably
- Category error – Values are for choosing within constraints, not aligning systems
- Structural vs value constraints – What-may-do vs what-should-want
- Tractability tradeoff – Narrower guarantees for implementation feasibility
Tags
Cross-References
Open Questions
- Can structural constraints cover enough to be safe without value-guidance?
- Is there hybrid approach combining tractable structure with bounded value-reference?