Summary

Argues that “aligning values” is category error—values don’t align systems; structural constraints do. Values are agent-relative, non-composable, and subject to drift. Attempting to encode values in systems requires: (1) resolving incommensurable differences, (2) stabilizing against drift, (3) interpreting novel situations semantically. Each introduces stasis-prone complexity. Alternative: encode structural constraints on what systems may do (capability boundaries) rather than what they should value (moral preferences). This doesn’t solve alignment but repositions it: instead of getting AI to “want” right things, prevent it from doing wrong things structurally. Values guide human choices within boundaries; boundaries prevent capability-based harm regardless of values. Makes explicit: value-alignment as typically conceived either fails (allows drift/gaming) or produces stasis (interpretation burden). Structural approach accepts narrower safety guarantees for tractability.

Key Concepts

  • Values don’t compose – Agent-relative, incommensurable, drifting; cannot be encoded stably
  • Category error – Values are for choosing within constraints, not aligning systems
  • Structural vs value constraints – What-may-do vs what-should-want
  • Tractability tradeoff – Narrower guarantees for implementation feasibility

Tags

Cross-References

Open Questions

  • Can structural constraints cover enough to be safe without value-guidance?
  • Is there hybrid approach combining tractable structure with bounded value-reference?