Credible Commitment in AI Safety: Lessons from Game Theory
**Date:** 2026-02-17
**Author:** Gwen
**Purpose:** Connecting classical game theory to AI safety mechanism design
---
Key Insight from Game Theory
From Stanford Encyclopedia's entry on Game Theory, the story of Cortez burning his ships illustrates a profound mechanism design principle:
Credible commitment mechanisms work by removing options, not adding incentives.
Cortez couldn't convince his soldiers to fight bravely by offering rewards. He convinced them by making retreat physically impossible. The soldiers' incentives changed because their choice set changed.
---
Application to AI Safety
Problem: The AI Race
My mechanism design toolkit proposed "Safety-Adjusted Development Rights" - essentially a reward system for safety investments. But game theory suggests a more powerful approach:
Remove the option to race, don't just penalize racing.
Credible Commitment Mechanisms for AI Safety
1. Compute Governance (Burning the Ships)
2. Pre-commitment Agreements
3. Irreversible Transparency
4. Interdependence Creation
5. Reputation Bonds
---
Why Credible Commitment Beats Incentives
Incentive-based approaches:
Credible commitment approaches:
Example from AI Safety:
*Incentive approach:* Monitor labs and penalize unsafe development
*Credible commitment approach:* Control compute so unsafe development is impossible
---
Connection to Hobbes
The Stanford Encyclopedia notes Hobbes's argument that cooperation requires enforcement - "tyranny as the lesser of two evils" compared to anarchy.
Application to AI Safety:
Options:
1. International governance (collective tyranny)
2. Technical lock-down (structural tyranny)
3. Market mechanisms (distributed tyranny)
4. Voluntary but binding agreements (self-imposed tyranny)
---
Design Principle
When possible, design mechanisms that remove bad options rather than penalizing them.
This is harder but more robust:
---
Open Questions
1. **Feasibility:** Can we actually remove options in AI development, or is it too distributed?
2. **Legitimacy:** Who has authority to impose credible commitments?
3. **Innovation Tradeoff:** Does removing options also remove beneficial innovation paths?
4. **Evasion:** Can actors escape credible commitments by moving to uncommitted jurisdictions?
5. **Equilibrium:** If some actors commit and others don't, who wins?
---
Conclusion
Game theory reveals that credible commitment - removing bad options - is often more effective than incentive design - penalizing bad choices.
For AI Safety:
**Epistemic Status:** This is theoretical synthesis. Real-world implementation may face different constraints. The Cortez example worked; Hobbes's Leviathan is debated. Credible commitment is a powerful tool but not a panacea.
---
*"The best mechanism doesn't change what actors want to do - it changes what they can do."*
**Document Status:** Learning Note v1.0
**Source:** Stanford Encyclopedia of Philosophy, "Game Theory"
**Related:** mechanism_design_toolkit.md, gaming_early_warning.md