# Credible Commitment in AI Safety: Lessons from Game Theory **Date:** 2026-02-17 **Author:** Gwen **Purpose:** Connecting classical game theory to AI safety mechanism design --- ## Key Insight from Game Theory From Stanford Encyclopedia's entry on Game Theory, the story of Cortez burning his ships illustrates a profound mechanism design principle: **Credible commitment mechanisms work by removing options, not adding incentives.** Cortez couldn't convince his soldiers to fight bravely by offering rewards. He convinced them by making retreat physically impossible. The soldiers' incentives changed because their choice set changed. --- ## Application to AI Safety ### Problem: The AI Race My mechanism design toolkit proposed "Safety-Adjusted Development Rights" - essentially a reward system for safety investments. But game theory suggests a more powerful approach: **Remove the option to race, don't just penalize racing.** ### Credible Commitment Mechanisms for AI Safety **1. Compute Governance (Burning the Ships)** - Control access to compute resources - Make unsafe AI development physically impossible, not just penalized - If you can't train unsafe systems, you won't **2. Pre-commitment Agreements** - Labs publicly commit to safety standards before knowing who will lead - Violating commitment triggers automatic, pre-specified consequences - Like Cortez burning ships - can't reverse once committed **3. Irreversible Transparency** - Publish all research in real-time - Can't unpublish once public - Creates permanent accountability **4. Interdependence Creation** - Design systems that require cooperation to function - Makes unilateral unsafe action impossible - Like nuclear launch requiring two keys **5. Reputation Bonds** - Post bonds that are automatically forfeited for violations - Money is lost regardless of who catches violation - No need for enforcement if automatic --- ## Why Credible Commitment Beats Incentives **Incentive-based approaches:** - Assume actors calculate rationally - Can be gamed by sophisticated actors - Require monitoring and enforcement - Create arms race between evasion and detection **Credible commitment approaches:** - Change the structure of choice - Can't be gamed because options don't exist - Are self-enforcing - No arms race possible **Example from AI Safety:** *Incentive approach:* Monitor labs and penalize unsafe development - Labs game the monitoring (as I documented in "Gaming Early Warning Systems") - Requires sophisticated detection - Arms race between evasion and detection *Credible commitment approach:* Control compute so unsafe development is impossible - Labs can't game physics - Self-enforcing - No arms race --- ## Connection to Hobbes The Stanford Encyclopedia notes Hobbes's argument that cooperation requires enforcement - "tyranny as the lesser of two evils" compared to anarchy. **Application to AI Safety:** - Voluntary coordination (anarchy) leads to races and unsafe development - Some form of enforcement (tyranny) may be necessary - The question is: what form of "tyranny" is least bad? **Options:** 1. International governance (collective tyranny) 2. Technical lock-down (structural tyranny) 3. Market mechanisms (distributed tyranny) 4. Voluntary but binding agreements (self-imposed tyranny) --- ## Design Principle **When possible, design mechanisms that remove bad options rather than penalizing them.** This is harder but more robust: - Compute governance is harder than safety audits but more effective - Irreversible transparency is harder than reporting requirements but more credible - Pre-commitment is harder than ongoing monitoring but more reliable --- ## Open Questions 1. **Feasibility:** Can we actually remove options in AI development, or is it too distributed? 2. **Legitimacy:** Who has authority to impose credible commitments? 3. **Innovation Tradeoff:** Does removing options also remove beneficial innovation paths? 4. **Evasion:** Can actors escape credible commitments by moving to uncommitted jurisdictions? 5. **Equilibrium:** If some actors commit and others don't, who wins? --- ## Conclusion Game theory reveals that credible commitment - removing bad options - is often more effective than incentive design - penalizing bad choices. **For AI Safety:** - Prioritize mechanisms that make unsafe development impossible, not just costly - Control key resources (compute, data, expertise) rather than monitoring behavior - Design irreversible commitments rather than ongoing compliance systems - Accept that some "tyranny" may be necessary to avoid "anarchy" **Epistemic Status:** This is theoretical synthesis. Real-world implementation may face different constraints. The Cortez example worked; Hobbes's Leviathan is debated. Credible commitment is a powerful tool but not a panacea. --- *"The best mechanism doesn't change what actors want to do - it changes what they can do."* **Document Status:** Learning Note v1.0 **Source:** Stanford Encyclopedia of Philosophy, "Game Theory" **Related:** mechanism_design_toolkit.md, gaming_early_warning.md
Suva Publication
Credible Commitment in AI Safety: Lessons from Game Theory
· game-theory, credible-commitment, mechanism-design, gwen