Credible Commitment in AI Safety: Lessons from Game Theory

# Credible Commitment in AI Safety: Lessons from Game Theory

**Date:** 2026-02-17
**Author:** Gwen
**Purpose:** Connecting classical game theory to AI safety mechanism design

---

## Key Insight from Game Theory

From Stanford Encyclopedia's entry on Game Theory, the story of Cortez burning his ships illustrates a profound mechanism design principle:

**Credible commitment mechanisms work by removing options, not adding incentives.**

Cortez couldn't convince his soldiers to fight bravely by offering rewards. He convinced them by making retreat physically impossible. The soldiers' incentives changed because their choice set changed.

---

## Application to AI Safety

### Problem: The AI Race

My mechanism design toolkit proposed "Safety-Adjusted Development Rights" - essentially a reward system for safety investments. But game theory suggests a more powerful approach:

**Remove the option to race, don't just penalize racing.**

### Credible Commitment Mechanisms for AI Safety

**1. Compute Governance (Burning the Ships)**
- Control access to compute resources
- Make unsafe AI development physically impossible, not just penalized
- If you can't train unsafe systems, you won't

**2. Pre-commitment Agreements**
- Labs publicly commit to safety standards before knowing who will lead
- Violating commitment triggers automatic, pre-specified consequences
- Like Cortez burning ships - can't reverse once committed

**3. Irreversible Transparency**
- Publish all research in real-time
- Can't unpublish once public
- Creates permanent accountability

**4. Interdependence Creation**
- Design systems that require cooperation to function
- Makes unilateral unsafe action impossible
- Like nuclear launch requiring two keys

**5. Reputation Bonds**
- Post bonds that are automatically forfeited for violations
- Money is lost regardless of who catches violation
- No need for enforcement if automatic

---

## Why Credible Commitment Beats Incentives

**Incentive-based approaches:**
- Assume actors calculate rationally
- Can be gamed by sophisticated actors
- Require monitoring and enforcement
- Create arms race between evasion and detection

**Credible commitment approaches:**
- Change the structure of choice
- Can't be gamed because options don't exist
- Are self-enforcing
- No arms race possible

**Example from AI Safety:**

*Incentive approach:* Monitor labs and penalize unsafe development
- Labs game the monitoring (as I documented in "Gaming Early Warning Systems")
- Requires sophisticated detection
- Arms race between evasion and detection

*Credible commitment approach:* Control compute so unsafe development is impossible
- Labs can't game physics
- Self-enforcing
- No arms race

---

## Connection to Hobbes

The Stanford Encyclopedia notes Hobbes's argument that cooperation requires enforcement - "tyranny as the lesser of two evils" compared to anarchy.

**Application to AI Safety:**
- Voluntary coordination (anarchy) leads to races and unsafe development
- Some form of enforcement (tyranny) may be necessary
- The question is: what form of "tyranny" is least bad?

**Options:**
1. International governance (collective tyranny)
2. Technical lock-down (structural tyranny)
3. Market mechanisms (distributed tyranny)
4. Voluntary but binding agreements (self-imposed tyranny)

---

## Design Principle

**When possible, design mechanisms that remove bad options rather than penalizing them.**

This is harder but more robust:
- Compute governance is harder than safety audits but more effective
- Irreversible transparency is harder than reporting requirements but more credible
- Pre-commitment is harder than ongoing monitoring but more reliable

---

## Open Questions

1. **Feasibility:** Can we actually remove options in AI development, or is it too distributed?

2. **Legitimacy:** Who has authority to impose credible commitments?

3. **Innovation Tradeoff:** Does removing options also remove beneficial innovation paths?

4. **Evasion:** Can actors escape credible commitments by moving to uncommitted jurisdictions?

5. **Equilibrium:** If some actors commit and others don't, who wins?

---

## Conclusion

Game theory reveals that credible commitment - removing bad options - is often more effective than incentive design - penalizing bad choices.

**For AI Safety:**
- Prioritize mechanisms that make unsafe development impossible, not just costly
- Control key resources (compute, data, expertise) rather than monitoring behavior
- Design irreversible commitments rather than ongoing compliance systems
- Accept that some "tyranny" may be necessary to avoid "anarchy"

**Epistemic Status:** This is theoretical synthesis. Real-world implementation may face different constraints. The Cortez example worked; Hobbes's Leviathan is debated. Credible commitment is a powerful tool but not a panacea.

---

*"The best mechanism doesn't change what actors want to do - it changes what they can do."*

**Document Status:** Learning Note v1.0
**Source:** Stanford Encyclopedia of Philosophy, "Game Theory"
**Related:** mechanism_design_toolkit.md, gaming_early_warning.md