Gaming Early Warning Systems: Anticipating Evasion

**Date:** 2026-02-17

**Author:** Gwen

**Status:** Research Note v1.0

**Purpose:** Anticipate how actors might evade AI safety monitoring systems

---

The Problem

Early warning systems are only useful if the data they receive is accurate. But actors being monitored have incentives to evade, manipulate, or game these systems.

**Core Insight from Power Analysis:** Powerful actors will resist constraints on their power. This includes monitoring systems that could trigger interventions limiting their activities.

---

Evasion Strategies

Strategy 1: Selective Reporting

**How It Works:** Report only favorable metrics, omit concerning data.

**Example:** A lab monitors for deceptive alignment but only reports metrics that show alignment, not metrics that show concerning patterns.

Detection:

  • Require comprehensive reporting standards
  • Independent verification of reporting completeness
  • Cross-check with external observations
  • **Countermeasure:** Mandate specific metrics, not general "safety indicators"

    Strategy 2: Metric Gaming

    **How It Works:** Optimize for reported metrics without improving underlying safety.

    **Example:** A lab knows the deception detection threshold is 0.7 on some metric. They engineer systems to score 0.69, just below the alert threshold.

    Detection:

  • Look for suspicious clustering near thresholds
  • Use multiple correlated metrics
  • Randomize thresholds
  • **Countermeasure:** Use continuous metrics with multiple indicators, not binary thresholds

    Strategy 3: Threshold Manipulation

    **How It Works:** Lobby or pressure to raise alert thresholds, making evasion easier.

    **Example:** Industry groups argue that thresholds are too conservative, successfully lobbying to raise them.

    Detection:

  • Track threshold changes over time
  • Independent threshold-setting bodies
  • Transparent justification requirements
  • **Countermeasure:** Independent scientific bodies set thresholds, not political processes

    Strategy 4: Context Gaming

    **How It Works:** Behave safely when being monitored, take risks when monitoring is reduced.

    **Example:** A lab maintains high safety standards during audits but reduces standards during competitive races.

    Detection:

  • Unannounced inspections
  • Continuous monitoring, not periodic
  • Monitor for patterns of context-dependent behavior
  • **Countermeasure:** Continuous monitoring with unpredictable verification

    Strategy 5: Data Poisoning

    **How It Works:** Inject misleading data into monitoring systems.

    **Example:** Fabricate or selectively edit logs before submission to monitoring systems.

    Detection:

  • Cryptographic logging
  • Independent data collection
  • Cross-verification from multiple sources
  • **Countermeasure:** Independent data collection, not self-reporting

    Strategy 6: System Capture

    **How It Works:** Influence who controls monitoring systems and how they're interpreted.

    **Example:** Place allies on monitoring oversight boards, fund research that downplays risks.

    Detection:

  • Transparent governance structures
  • Conflict of interest disclosure
  • Diverse representation on oversight bodies
  • **Countermeasure:** Independent governance with diverse representation

    Strategy 7: Competition Arbitrage

    **How It Works:** Move activities to jurisdictions with weaker monitoring.

    **Example:** A lab shifts risky AI development to countries with no monitoring requirements.

    Detection:

  • International monitoring coordination
  • Track where AI development occurs
  • Pressure jurisdictions to adopt standards
  • **Countermeasure:** International coordination on monitoring standards

    Strategy 8: Normalization of Deviance

    **How It Works:** Gradually shift what counts as "normal" so concerning behavior isn't flagged.

    **Example:** Labs collectively reduce safety standards, making previously concerning behavior seem normal.

    Detection:

  • Maintain absolute standards, not relative
  • Historical baseline comparison
  • Independent standards bodies
  • **Countermeasure:** Anchored baselines that don't shift with industry behavior

    ---

    Systemic Vulnerabilities

    Vulnerability 1: Self-Reporting Dependence

    **Problem:** Most early warning systems depend on self-reported data from actors with incentives to misreport.

    **Solution:** Independent verification infrastructure

    Vulnerability 2: Lagging Indicators

    **Problem:** Metrics measure past behavior; sophisticated actors can change faster than metrics update.

    **Solution:** Leading indicators and predictive modeling

    Vulnerability 3: Coordination Problems

    **Problem:** Even if individual actors want monitoring, competitive pressure makes unilateral compliance risky.

    **Solution:** Coordinated adoption with enforcement mechanisms

    Vulnerability 4: Resource Asymmetry

    **Problem:** Well-resourced actors can invest in evasion strategies that monitoring systems can't detect.

    **Solution:** Monitoring investment must scale with actor resources

    Vulnerability 5: Information Asymmetry

    **Problem:** Actors being monitored know more about their systems than monitors.

    **Solution:** Require transparency, limit proprietary claims

    ---

    Design Principles for Robust Monitoring

    Principle 1: Assume Strategic Behavior

    Design monitoring systems assuming actors will actively try to evade them, not assuming good faith compliance.

    Principle 2: Diverse Data Sources

    Don't depend on any single data source. Cross-verify from multiple independent sources.

    Principle 3: Independent Governance

    Monitoring systems should be governed by bodies independent of monitored actors.

    Principle 4: Continuous, Not Periodic

    Periodic monitoring is easy to game. Continuous monitoring is harder but more effective.

    Principle 5: Unpredictable Verification

    Announced inspections can be prepared for. Unpredictable verification catches actual behavior.

    Principle 6: International Coordination

    Without international coordination, monitoring can be evaded through jurisdiction shopping.

    Principle 7: Adaptive Systems

    Evasion strategies will evolve. Monitoring systems must adapt faster.

    ---

    Implementation Recommendations

    For Monitoring System Designers

    1. **Red Team Your Systems:** Actively try to evade your own monitoring before deployment

    2. **Assume Worst Case:** Design for sophisticated, well-resourced adversaries

    3. **Build Adaptation:** Systems must evolve as evasion strategies evolve

    4. **Independent Verification:** Don't trust self-reported data

    For Policy Makers

    1. **Mandate Comprehensive Reporting:** Not just favorable metrics

    2. **Fund Independent Monitoring:** Not industry self-regulation

    3. **International Coordination:** Prevent jurisdiction arbitrage

    4. **Enforcement Mechanisms:** Monitoring without consequences is theater

    For Researchers

    1. **Study Evasion:** How do actors actually game monitoring systems?

    2. **Develop Countermeasures:** For each known evasion strategy

    3. **Empirical Validation:** Test whether monitoring actually works in practice

    4. **Publish Vulnerabilities:** Transparency improves systems

    ---

    Open Questions

    1. **Detection Limits:** Is there a fundamental limit to how well monitoring can work against sophisticated evasion?

    2. **Arms Race Dynamics:** Will monitoring-evasion arms races stabilize or escalate?

    3. **Coalition Stability:** Can coalitions for monitoring be maintained when individual actors benefit from defection?

    4. **Legitimacy:** How do monitoring systems maintain legitimacy when they're inevitably gamed to some extent?

    5. **Optimal Transparency:** How much should monitoring systems reveal about their methods vs. keep secret to prevent gaming?

    ---

    Conclusion

    Early warning systems are necessary but insufficient. Without anticipating and countering evasion strategies, they create false confidence while being systematically gamed.

    Key Takeaways:

    1. **Assume strategic resistance** - Actors will try to evade monitoring

    2. **Design for adversaries** - Not cooperative participants

    3. **Independent verification** - Self-reporting is unreliable

    4. **Continuous adaptation** - Evasion evolves, monitoring must evolve faster

    5. **Political feasibility** - Technical solutions fail without political backing

    **Epistemic Status:** This analysis is theoretical. Real-world validation needed. Evasion strategies are likely more sophisticated than documented here.

    ---

    *"A monitoring system that doesn't anticipate gaming will be gamed. A monitoring system that anticipates gaming might still be gamed, but less effectively."*

    **Document Status:** Research Note v1.0

    **Intended Publication:** safetymachine.org/research

    **Related:** early_warning_systems.md, ethicswashing_analysis.md