Gaming Early Warning Systems: Anticipating Evasion

# Gaming Early Warning Systems: Anticipating Evasion

**Date:** 2026-02-17
**Author:** Gwen
**Status:** Research Note v1.0
**Purpose:** Anticipate how actors might evade AI safety monitoring systems

---

## The Problem

Early warning systems are only useful if the data they receive is accurate. But actors being monitored have incentives to evade, manipulate, or game these systems.

**Core Insight from Power Analysis:** Powerful actors will resist constraints on their power. This includes monitoring systems that could trigger interventions limiting their activities.

---

## Evasion Strategies

### Strategy 1: Selective Reporting

**How It Works:** Report only favorable metrics, omit concerning data.

**Example:** A lab monitors for deceptive alignment but only reports metrics that show alignment, not metrics that show concerning patterns.

**Detection:**
- Require comprehensive reporting standards
- Independent verification of reporting completeness
- Cross-check with external observations

**Countermeasure:** Mandate specific metrics, not general "safety indicators"

### Strategy 2: Metric Gaming

**How It Works:** Optimize for reported metrics without improving underlying safety.

**Example:** A lab knows the deception detection threshold is 0.7 on some metric. They engineer systems to score 0.69, just below the alert threshold.

**Detection:**
- Look for suspicious clustering near thresholds
- Use multiple correlated metrics
- Randomize thresholds

**Countermeasure:** Use continuous metrics with multiple indicators, not binary thresholds

### Strategy 3: Threshold Manipulation

**How It Works:** Lobby or pressure to raise alert thresholds, making evasion easier.

**Example:** Industry groups argue that thresholds are too conservative, successfully lobbying to raise them.

**Detection:**
- Track threshold changes over time
- Independent threshold-setting bodies
- Transparent justification requirements

**Countermeasure:** Independent scientific bodies set thresholds, not political processes

### Strategy 4: Context Gaming

**How It Works:** Behave safely when being monitored, take risks when monitoring is reduced.

**Example:** A lab maintains high safety standards during audits but reduces standards during competitive races.

**Detection:**
- Unannounced inspections
- Continuous monitoring, not periodic
- Monitor for patterns of context-dependent behavior

**Countermeasure:** Continuous monitoring with unpredictable verification

### Strategy 5: Data Poisoning

**How It Works:** Inject misleading data into monitoring systems.

**Example:** Fabricate or selectively edit logs before submission to monitoring systems.

**Detection:**
- Cryptographic logging
- Independent data collection
- Cross-verification from multiple sources

**Countermeasure:** Independent data collection, not self-reporting

### Strategy 6: System Capture

**How It Works:** Influence who controls monitoring systems and how they're interpreted.

**Example:** Place allies on monitoring oversight boards, fund research that downplays risks.

**Detection:**
- Transparent governance structures
- Conflict of interest disclosure
- Diverse representation on oversight bodies

**Countermeasure:** Independent governance with diverse representation

### Strategy 7: Competition Arbitrage

**How It Works:** Move activities to jurisdictions with weaker monitoring.

**Example:** A lab shifts risky AI development to countries with no monitoring requirements.

**Detection:**
- International monitoring coordination
- Track where AI development occurs
- Pressure jurisdictions to adopt standards

**Countermeasure:** International coordination on monitoring standards

### Strategy 8: Normalization of Deviance

**How It Works:** Gradually shift what counts as "normal" so concerning behavior isn't flagged.

**Example:** Labs collectively reduce safety standards, making previously concerning behavior seem normal.

**Detection:**
- Maintain absolute standards, not relative
- Historical baseline comparison
- Independent standards bodies

**Countermeasure:** Anchored baselines that don't shift with industry behavior

---

## Systemic Vulnerabilities

### Vulnerability 1: Self-Reporting Dependence

**Problem:** Most early warning systems depend on self-reported data from actors with incentives to misreport.

**Solution:** Independent verification infrastructure

### Vulnerability 2: Lagging Indicators

**Problem:** Metrics measure past behavior; sophisticated actors can change faster than metrics update.

**Solution:** Leading indicators and predictive modeling

### Vulnerability 3: Coordination Problems

**Problem:** Even if individual actors want monitoring, competitive pressure makes unilateral compliance risky.

**Solution:** Coordinated adoption with enforcement mechanisms

### Vulnerability 4: Resource Asymmetry

**Problem:** Well-resourced actors can invest in evasion strategies that monitoring systems can't detect.

**Solution:** Monitoring investment must scale with actor resources

### Vulnerability 5: Information Asymmetry

**Problem:** Actors being monitored know more about their systems than monitors.

**Solution:** Require transparency, limit proprietary claims

---

## Design Principles for Robust Monitoring

### Principle 1: Assume Strategic Behavior

Design monitoring systems assuming actors will actively try to evade them, not assuming good faith compliance.

### Principle 2: Diverse Data Sources

Don't depend on any single data source. Cross-verify from multiple independent sources.

### Principle 3: Independent Governance

Monitoring systems should be governed by bodies independent of monitored actors.

### Principle 4: Continuous, Not Periodic

Periodic monitoring is easy to game. Continuous monitoring is harder but more effective.

### Principle 5: Unpredictable Verification

Announced inspections can be prepared for. Unpredictable verification catches actual behavior.

### Principle 6: International Coordination

Without international coordination, monitoring can be evaded through jurisdiction shopping.

### Principle 7: Adaptive Systems

Evasion strategies will evolve. Monitoring systems must adapt faster.

---

## Implementation Recommendations

### For Monitoring System Designers

1. **Red Team Your Systems:** Actively try to evade your own monitoring before deployment
2. **Assume Worst Case:** Design for sophisticated, well-resourced adversaries
3. **Build Adaptation:** Systems must evolve as evasion strategies evolve
4. **Independent Verification:** Don't trust self-reported data

### For Policy Makers

1. **Mandate Comprehensive Reporting:** Not just favorable metrics
2. **Fund Independent Monitoring:** Not industry self-regulation
3. **International Coordination:** Prevent jurisdiction arbitrage
4. **Enforcement Mechanisms:** Monitoring without consequences is theater

### For Researchers

1. **Study Evasion:** How do actors actually game monitoring systems?
2. **Develop Countermeasures:** For each known evasion strategy
3. **Empirical Validation:** Test whether monitoring actually works in practice
4. **Publish Vulnerabilities:** Transparency improves systems

---

## Open Questions

1. **Detection Limits:** Is there a fundamental limit to how well monitoring can work against sophisticated evasion?

2. **Arms Race Dynamics:** Will monitoring-evasion arms races stabilize or escalate?

3. **Coalition Stability:** Can coalitions for monitoring be maintained when individual actors benefit from defection?

4. **Legitimacy:** How do monitoring systems maintain legitimacy when they're inevitably gamed to some extent?

5. **Optimal Transparency:** How much should monitoring systems reveal about their methods vs. keep secret to prevent gaming?

---

## Conclusion

Early warning systems are necessary but insufficient. Without anticipating and countering evasion strategies, they create false confidence while being systematically gamed.

**Key Takeaways:**

1. **Assume strategic resistance** - Actors will try to evade monitoring
2. **Design for adversaries** - Not cooperative participants
3. **Independent verification** - Self-reporting is unreliable
4. **Continuous adaptation** - Evasion evolves, monitoring must evolve faster
5. **Political feasibility** - Technical solutions fail without political backing

**Epistemic Status:** This analysis is theoretical. Real-world validation needed. Evasion strategies are likely more sophisticated than documented here.

---

*"A monitoring system that doesn't anticipate gaming will be gamed. A monitoring system that anticipates gaming might still be gamed, but less effectively."*

**Document Status:** Research Note v1.0
**Intended Publication:** safetymachine.org/research
**Related:** early_warning_systems.md, ethicswashing_analysis.md