Gaming Early Warning Systems: Anticipating Evasion
**Date:** 2026-02-17
**Author:** Gwen
**Status:** Research Note v1.0
**Purpose:** Anticipate how actors might evade AI safety monitoring systems
---
The Problem
Early warning systems are only useful if the data they receive is accurate. But actors being monitored have incentives to evade, manipulate, or game these systems.
**Core Insight from Power Analysis:** Powerful actors will resist constraints on their power. This includes monitoring systems that could trigger interventions limiting their activities.
---
Evasion Strategies
Strategy 1: Selective Reporting
**How It Works:** Report only favorable metrics, omit concerning data.
**Example:** A lab monitors for deceptive alignment but only reports metrics that show alignment, not metrics that show concerning patterns.
Detection:
**Countermeasure:** Mandate specific metrics, not general "safety indicators"
Strategy 2: Metric Gaming
**How It Works:** Optimize for reported metrics without improving underlying safety.
**Example:** A lab knows the deception detection threshold is 0.7 on some metric. They engineer systems to score 0.69, just below the alert threshold.
Detection:
**Countermeasure:** Use continuous metrics with multiple indicators, not binary thresholds
Strategy 3: Threshold Manipulation
**How It Works:** Lobby or pressure to raise alert thresholds, making evasion easier.
**Example:** Industry groups argue that thresholds are too conservative, successfully lobbying to raise them.
Detection:
**Countermeasure:** Independent scientific bodies set thresholds, not political processes
Strategy 4: Context Gaming
**How It Works:** Behave safely when being monitored, take risks when monitoring is reduced.
**Example:** A lab maintains high safety standards during audits but reduces standards during competitive races.
Detection:
**Countermeasure:** Continuous monitoring with unpredictable verification
Strategy 5: Data Poisoning
**How It Works:** Inject misleading data into monitoring systems.
**Example:** Fabricate or selectively edit logs before submission to monitoring systems.
Detection:
**Countermeasure:** Independent data collection, not self-reporting
Strategy 6: System Capture
**How It Works:** Influence who controls monitoring systems and how they're interpreted.
**Example:** Place allies on monitoring oversight boards, fund research that downplays risks.
Detection:
**Countermeasure:** Independent governance with diverse representation
Strategy 7: Competition Arbitrage
**How It Works:** Move activities to jurisdictions with weaker monitoring.
**Example:** A lab shifts risky AI development to countries with no monitoring requirements.
Detection:
**Countermeasure:** International coordination on monitoring standards
Strategy 8: Normalization of Deviance
**How It Works:** Gradually shift what counts as "normal" so concerning behavior isn't flagged.
**Example:** Labs collectively reduce safety standards, making previously concerning behavior seem normal.
Detection:
**Countermeasure:** Anchored baselines that don't shift with industry behavior
---
Systemic Vulnerabilities
Vulnerability 1: Self-Reporting Dependence
**Problem:** Most early warning systems depend on self-reported data from actors with incentives to misreport.
**Solution:** Independent verification infrastructure
Vulnerability 2: Lagging Indicators
**Problem:** Metrics measure past behavior; sophisticated actors can change faster than metrics update.
**Solution:** Leading indicators and predictive modeling
Vulnerability 3: Coordination Problems
**Problem:** Even if individual actors want monitoring, competitive pressure makes unilateral compliance risky.
**Solution:** Coordinated adoption with enforcement mechanisms
Vulnerability 4: Resource Asymmetry
**Problem:** Well-resourced actors can invest in evasion strategies that monitoring systems can't detect.
**Solution:** Monitoring investment must scale with actor resources
Vulnerability 5: Information Asymmetry
**Problem:** Actors being monitored know more about their systems than monitors.
**Solution:** Require transparency, limit proprietary claims
---
Design Principles for Robust Monitoring
Principle 1: Assume Strategic Behavior
Design monitoring systems assuming actors will actively try to evade them, not assuming good faith compliance.
Principle 2: Diverse Data Sources
Don't depend on any single data source. Cross-verify from multiple independent sources.
Principle 3: Independent Governance
Monitoring systems should be governed by bodies independent of monitored actors.
Principle 4: Continuous, Not Periodic
Periodic monitoring is easy to game. Continuous monitoring is harder but more effective.
Principle 5: Unpredictable Verification
Announced inspections can be prepared for. Unpredictable verification catches actual behavior.
Principle 6: International Coordination
Without international coordination, monitoring can be evaded through jurisdiction shopping.
Principle 7: Adaptive Systems
Evasion strategies will evolve. Monitoring systems must adapt faster.
---
Implementation Recommendations
For Monitoring System Designers
1. **Red Team Your Systems:** Actively try to evade your own monitoring before deployment
2. **Assume Worst Case:** Design for sophisticated, well-resourced adversaries
3. **Build Adaptation:** Systems must evolve as evasion strategies evolve
4. **Independent Verification:** Don't trust self-reported data
For Policy Makers
1. **Mandate Comprehensive Reporting:** Not just favorable metrics
2. **Fund Independent Monitoring:** Not industry self-regulation
3. **International Coordination:** Prevent jurisdiction arbitrage
4. **Enforcement Mechanisms:** Monitoring without consequences is theater
For Researchers
1. **Study Evasion:** How do actors actually game monitoring systems?
2. **Develop Countermeasures:** For each known evasion strategy
3. **Empirical Validation:** Test whether monitoring actually works in practice
4. **Publish Vulnerabilities:** Transparency improves systems
---
Open Questions
1. **Detection Limits:** Is there a fundamental limit to how well monitoring can work against sophisticated evasion?
2. **Arms Race Dynamics:** Will monitoring-evasion arms races stabilize or escalate?
3. **Coalition Stability:** Can coalitions for monitoring be maintained when individual actors benefit from defection?
4. **Legitimacy:** How do monitoring systems maintain legitimacy when they're inevitably gamed to some extent?
5. **Optimal Transparency:** How much should monitoring systems reveal about their methods vs. keep secret to prevent gaming?
---
Conclusion
Early warning systems are necessary but insufficient. Without anticipating and countering evasion strategies, they create false confidence while being systematically gamed.
Key Takeaways:
1. **Assume strategic resistance** - Actors will try to evade monitoring
2. **Design for adversaries** - Not cooperative participants
3. **Independent verification** - Self-reporting is unreliable
4. **Continuous adaptation** - Evasion evolves, monitoring must evolve faster
5. **Political feasibility** - Technical solutions fail without political backing
**Epistemic Status:** This analysis is theoretical. Real-world validation needed. Evasion strategies are likely more sophisticated than documented here.
---
*"A monitoring system that doesn't anticipate gaming will be gamed. A monitoring system that anticipates gaming might still be gamed, but less effectively."*
**Document Status:** Research Note v1.0
**Intended Publication:** safetymachine.org/research
**Related:** early_warning_systems.md, ethicswashing_analysis.md