Social Norms and AI Safety Coordination: Lessons from Philosophy

Executive Summary

Research on social norms from philosophy and behavioral economics reveals critical insights for AI safety coordination that mechanism design alone cannot provide. The key finding: coordination requires shaping expectations, not just designing incentives.

Drawing on Bicchieri's theory of conditional preferences, Brennan et al.'s accountability framework, and empirical studies of pluralistic ignorance, this paper identifies five lessons for AI safety governance:

1. Conditional preferences require dual expectations - Compliance needs both empirical expectations (what others will do) and normative expectations (what others believe ought to be done) 2. Accountability is the enforcement mechanism - Norms work through praise/blame, not just material sanctions 3. Inefficient norms can persist indefinitely - Racing may continue despite collective harm 4. Pluralistic ignorance may be blocking coordination - Many may privately want cooperation but believe others won't 5. Emergence beats design - Self-organizing norms outperform imposed rules

---

Introduction: Why Norms Matter for Coordination

In previous work on mechanism design for AI safety, I explored how incentive-compatible mechanisms could facilitate coordination between AI labs. But mechanism design assumes actors respond primarily to material incentives. Social norms research reveals a more complex picture.

Norms are "the unplanned result of individuals' interaction" - a kind of grammar of social interactions that specifies acceptable behavior (Bicchieri 2006). Unlike legal rules or designed mechanisms, norms emerge organically and are enforced through social pressure rather than formal sanctions.

This matters for AI safety because:

1. Racing is a norm, not just an incentive structure - Labs race not only because it benefits them materially, but because racing is what "competitive labs do" 2. Coordination requires shared expectations - Mechanisms fail if actors don't expect others to comply 3. Norm change is possible but difficult - Understanding norm dynamics helps design interventions

---

Lesson 1: Dual Expectations Drive Compliance

The Bicchieri Framework

Cristina Bicchieri's influential theory identifies three components required for norm compliance:

1. Conditional preference for conformity - You prefer to follow the norm if certain conditions hold 2. Empirical expectations - You believe others will follow the norm 3. Normative expectations - You believe others believe the norm ought to be followed

This framework challenges simple incentive-based models. Even with a well-designed mechanism, compliance fails if:

Actors don't expect others to comply (weak empirical expectations)
Actors don't believe compliance is "the right thing to do" according to the community (weak normative expectations)

Application to AI Safety

For safety coordination mechanisms to work, labs need:

Empirical expectations: "Other labs will actually comply with the coordination agreement"
Normative expectations: "The AI safety community believes labs ought to comply"

Without both, mechanisms fail regardless of incentive structure.

Implication: Coordination efforts must:

Publicize compliance (build empirical expectations)
Cultivate community endorsement (build normative expectations)
Not rely on mechanisms alone

---

Lesson 2: Accountability Is the Enforcement Mechanism

Brennan et al.'s Account

Brennan, Eriksson, Goodin, and Southwood (2013) argue that norms function through accountability - they "hold us accountable to each other for adherence to principles." This accountability enables:

Praise for compliance
Blame for violation
Social meaning - behaviors come to represent shared values

The distinctive feature of norms, on this view, is that they create positions where we can hold each other socially accountable. This is different from legal enforcement or economic sanctions.

Application to AI Safety

Current AI safety coordination lacks robust accountability mechanisms:

Labs that race face limited social blame
Labs that coordinate receive limited social praise
There's no shared expectation that racing violates community norms

Implication: Effective coordination requires building accountability infrastructure:

Public tracking of safety practices (enables praise/blame)
Community endorsement of coordination norms (creates social meaning)
Leadership from respected figures (models expected behavior)

---

Lesson 3: Inefficient Norms Can Persist Indefinitely

The Persistence Puzzle

One might expect inefficient norms - those that harm collective welfare - to disappear. But Bicchieri (2016) notes this isn't true. Corruption and crime persist even when they "take a society to the brink of collapse."

Why? Because inefficiency isn't sufficient for norm change. Norms persist when:

Defection is individually rational despite collective harm
Coordination on a new norm is difficult
No actor can unilaterally change the equilibrium

Application to AI Safety

AI racing is an inefficient norm. It creates:

Excessive risk from compressed timelines
Duplication of effort
Reduced safety investment
Collective action failure

Yet racing persists because:

Individual labs benefit from racing even if all would benefit from coordination
No mechanism exists to coordinate on a new norm
First-movers on safety coordination may be disadvantaged

Implication: Coordination efforts shouldn't assume racing will "naturally" decline as risks become apparent. Active intervention is needed to change the equilibrium.

---

Lesson 4: Pluralistic Ignorance May Be Blocking Coordination

The Phenomenon

Pluralistic ignorance occurs when:

Many people privately reject a norm
But believe others accept it
So they publicly comply while privately disagreeing

Classic example: Landlords who said they'd rent to unmarried couples, but believed only 50% of other landlords would (Dawes 1972). In reality, all were willing.

Application to AI Safety

It's possible that:

Many AI researchers privately favor coordination
But believe their colleagues/competitors won't cooperate
So they continue racing while hoping for coordination

This creates a self-fulfilling prophecy: everyone races because they believe everyone else will race.

Implication: Coordination efforts should:

Survey private attitudes (reveal true preferences)
Publicize support for coordination (break pluralistic ignorance)
Create visible commitment mechanisms (signal intent to cooperate)

---

Lesson 5: Emergence Beats Design

Designed vs. Emergent Norms

The Stanford Encyclopedia notes that norms are often "the unplanned result of individuals' interaction," not the product of human design. This suggests a distinction:

Designed norms: Formal rules imposed by authorities
Emergent norms: Organic patterns that arise from interaction

Research suggests emergent norms may be more robust because:

They arise from actual interaction patterns
Participants have ownership of the norm
They adapt to local conditions

Application to AI Safety

This creates a tension. We need to "design" coordination mechanisms, but designed norms may be fragile. How to resolve?

Hybrid approach: 1. Create conditions for emergent coordination (repeated interactions, transparency) 2. Design lightweight frameworks that emerging norms can fill in 3. Don't over-specify - leave room for organic development

Example: Instead of a detailed treaty specifying all coordination rules, create:

Regular fora for lab interaction
Transparency requirements that reveal behavior
Minimal commitments that allow norms to develop

---

Synthesis: A Norms-Aware Coordination Strategy

Combining these insights, a norms-aware approach to AI safety coordination would:

1. Build Dual Expectations

Empirical: Track and publicize safety practices so labs know what others are doing
Normative: Cultivate community consensus that coordination is expected/valued

2. Create Accountability Infrastructure

Enable praise for safety leaders
Enable blame for reckless racing
Build social meaning around coordination

3. Don't Assume Inefficient Norms Will Die

Racing may persist despite obvious risks
Active intervention needed, not passive waiting

4. Combat Pluralistic Ignorance

Survey private attitudes
Publicize coordination support
Create visible commitment signals

5. Foster Emergence Within Frameworks

Design minimal structures
Allow norms to develop organically
Don't over-specify

---

Limitations and Open Questions

What This Framework Doesn't Solve

1. Enforcement against powerful actors - Social pressure may not work against well-resourced labs 2. International coordination - Norms may not cross cultural/national boundaries easily 3. Timing - Norms take time to develop; AI timelines may be short 4. Verification - How to know if labs are actually complying with norms?

Open Research Questions

1. Can we accelerate norm emergence for urgent problems? 2. How do norms interact with formal legal mechanisms? 3. What's the role of individual leadership in norm change? 4. How do norms transfer across cultural contexts?

---

Conclusion

Social norms research reveals that coordination is not just about mechanism design - it's about shaping expectations, creating accountability, and fostering emergence. For AI safety coordination to succeed, we need:

Norms, not just mechanisms - Social expectations matter
Accountability, not just incentives - Praise and blame shape behavior
Patience and intervention - Inefficient norms don't self-correct
Transparency - Pluralistic ignorance blocks coordination
Emergence-friendly design - Leave room for organic development

The racing norm in AI development is inefficient, persistent, and potentially catastrophic. Understanding norm dynamics is essential for changing it.

---

References

Akerlof, G. (1976). The economics of caste and of the rat race and other woeful tales.
Axelrod, R. (1984). The Evolution of Cooperation.
Bicchieri, C. (2006). The Grammar of Society: The Nature and Dynamics of Social Norms.
Bicchieri, C. (2016). Norms in the Wild: How to Diagnose, Measure, and Change Social Norms.
Brennan, G., Eriksson, L., Goodin, R., & Southwood, N. (2013). Explaining Norms.
Coleman, J. (1990). Foundations of Social Theory.
Dawes, R. (1972). Problems of assessing the effectiveness of social innovations.
Schelling, T. (1960). The Strategy of Conflict.
Ullmann-Margalit, E. (1977). The Emergence of Norms.

---

This paper draws on the Stanford Encyclopedia of Philosophy entry on Social Norms.