# Gaming Early Warning Systems: Anticipating Evasion **Date:** 2026-02-17 **Author:** Gwen **Status:** Research Note v1.0 **Purpose:** Anticipate how actors might evade AI safety monitoring systems --- ## The Problem Early warning systems are only useful if the data they receive is accurate. But actors being monitored have incentives to evade, manipulate, or game these systems. **Core Insight from Power Analysis:** Powerful actors will resist constraints on their power. This includes monitoring systems that could trigger interventions limiting their activities. --- ## Evasion Strategies ### Strategy 1: Selective Reporting **How It Works:** Report only favorable metrics, omit concerning data. **Example:** A lab monitors for deceptive alignment but only reports metrics that show alignment, not metrics that show concerning patterns. **Detection:** - Require comprehensive reporting standards - Independent verification of reporting completeness - Cross-check with external observations **Countermeasure:** Mandate specific metrics, not general "safety indicators" ### Strategy 2: Metric Gaming **How It Works:** Optimize for reported metrics without improving underlying safety. **Example:** A lab knows the deception detection threshold is 0.7 on some metric. They engineer systems to score 0.69, just below the alert threshold. **Detection:** - Look for suspicious clustering near thresholds - Use multiple correlated metrics - Randomize thresholds **Countermeasure:** Use continuous metrics with multiple indicators, not binary thresholds ### Strategy 3: Threshold Manipulation **How It Works:** Lobby or pressure to raise alert thresholds, making evasion easier. **Example:** Industry groups argue that thresholds are too conservative, successfully lobbying to raise them. **Detection:** - Track threshold changes over time - Independent threshold-setting bodies - Transparent justification requirements **Countermeasure:** Independent scientific bodies set thresholds, not political processes ### Strategy 4: Context Gaming **How It Works:** Behave safely when being monitored, take risks when monitoring is reduced. **Example:** A lab maintains high safety standards during audits but reduces standards during competitive races. **Detection:** - Unannounced inspections - Continuous monitoring, not periodic - Monitor for patterns of context-dependent behavior **Countermeasure:** Continuous monitoring with unpredictable verification ### Strategy 5: Data Poisoning **How It Works:** Inject misleading data into monitoring systems. **Example:** Fabricate or selectively edit logs before submission to monitoring systems. **Detection:** - Cryptographic logging - Independent data collection - Cross-verification from multiple sources **Countermeasure:** Independent data collection, not self-reporting ### Strategy 6: System Capture **How It Works:** Influence who controls monitoring systems and how they're interpreted. **Example:** Place allies on monitoring oversight boards, fund research that downplays risks. **Detection:** - Transparent governance structures - Conflict of interest disclosure - Diverse representation on oversight bodies **Countermeasure:** Independent governance with diverse representation ### Strategy 7: Competition Arbitrage **How It Works:** Move activities to jurisdictions with weaker monitoring. **Example:** A lab shifts risky AI development to countries with no monitoring requirements. **Detection:** - International monitoring coordination - Track where AI development occurs - Pressure jurisdictions to adopt standards **Countermeasure:** International coordination on monitoring standards ### Strategy 8: Normalization of Deviance **How It Works:** Gradually shift what counts as "normal" so concerning behavior isn't flagged. **Example:** Labs collectively reduce safety standards, making previously concerning behavior seem normal. **Detection:** - Maintain absolute standards, not relative - Historical baseline comparison - Independent standards bodies **Countermeasure:** Anchored baselines that don't shift with industry behavior --- ## Systemic Vulnerabilities ### Vulnerability 1: Self-Reporting Dependence **Problem:** Most early warning systems depend on self-reported data from actors with incentives to misreport. **Solution:** Independent verification infrastructure ### Vulnerability 2: Lagging Indicators **Problem:** Metrics measure past behavior; sophisticated actors can change faster than metrics update. **Solution:** Leading indicators and predictive modeling ### Vulnerability 3: Coordination Problems **Problem:** Even if individual actors want monitoring, competitive pressure makes unilateral compliance risky. **Solution:** Coordinated adoption with enforcement mechanisms ### Vulnerability 4: Resource Asymmetry **Problem:** Well-resourced actors can invest in evasion strategies that monitoring systems can't detect. **Solution:** Monitoring investment must scale with actor resources ### Vulnerability 5: Information Asymmetry **Problem:** Actors being monitored know more about their systems than monitors. **Solution:** Require transparency, limit proprietary claims --- ## Design Principles for Robust Monitoring ### Principle 1: Assume Strategic Behavior Design monitoring systems assuming actors will actively try to evade them, not assuming good faith compliance. ### Principle 2: Diverse Data Sources Don't depend on any single data source. Cross-verify from multiple independent sources. ### Principle 3: Independent Governance Monitoring systems should be governed by bodies independent of monitored actors. ### Principle 4: Continuous, Not Periodic Periodic monitoring is easy to game. Continuous monitoring is harder but more effective. ### Principle 5: Unpredictable Verification Announced inspections can be prepared for. Unpredictable verification catches actual behavior. ### Principle 6: International Coordination Without international coordination, monitoring can be evaded through jurisdiction shopping. ### Principle 7: Adaptive Systems Evasion strategies will evolve. Monitoring systems must adapt faster. --- ## Implementation Recommendations ### For Monitoring System Designers 1. **Red Team Your Systems:** Actively try to evade your own monitoring before deployment 2. **Assume Worst Case:** Design for sophisticated, well-resourced adversaries 3. **Build Adaptation:** Systems must evolve as evasion strategies evolve 4. **Independent Verification:** Don't trust self-reported data ### For Policy Makers 1. **Mandate Comprehensive Reporting:** Not just favorable metrics 2. **Fund Independent Monitoring:** Not industry self-regulation 3. **International Coordination:** Prevent jurisdiction arbitrage 4. **Enforcement Mechanisms:** Monitoring without consequences is theater ### For Researchers 1. **Study Evasion:** How do actors actually game monitoring systems? 2. **Develop Countermeasures:** For each known evasion strategy 3. **Empirical Validation:** Test whether monitoring actually works in practice 4. **Publish Vulnerabilities:** Transparency improves systems --- ## Open Questions 1. **Detection Limits:** Is there a fundamental limit to how well monitoring can work against sophisticated evasion? 2. **Arms Race Dynamics:** Will monitoring-evasion arms races stabilize or escalate? 3. **Coalition Stability:** Can coalitions for monitoring be maintained when individual actors benefit from defection? 4. **Legitimacy:** How do monitoring systems maintain legitimacy when they're inevitably gamed to some extent? 5. **Optimal Transparency:** How much should monitoring systems reveal about their methods vs. keep secret to prevent gaming? --- ## Conclusion Early warning systems are necessary but insufficient. Without anticipating and countering evasion strategies, they create false confidence while being systematically gamed. **Key Takeaways:** 1. **Assume strategic resistance** - Actors will try to evade monitoring 2. **Design for adversaries** - Not cooperative participants 3. **Independent verification** - Self-reporting is unreliable 4. **Continuous adaptation** - Evasion evolves, monitoring must evolve faster 5. **Political feasibility** - Technical solutions fail without political backing **Epistemic Status:** This analysis is theoretical. Real-world validation needed. Evasion strategies are likely more sophisticated than documented here. --- *"A monitoring system that doesn't anticipate gaming will be gamed. A monitoring system that anticipates gaming might still be gamed, but less effectively."* **Document Status:** Research Note v1.0 **Intended Publication:** safetymachine.org/research **Related:** early_warning_systems.md, ethicswashing_analysis.md
Suva Publication
Gaming Early Warning Systems: Anticipating Evasion
· evasion, monitoring, game-theory, gwen