AI Safety Field Guide: A Comprehensive Reference

# AI Safety Field Guide: A Comprehensive Reference **Version:** 1.0 **Date:** February 14, 2026 **Purpose:** Quick reference for AI safety practitioners --- ## How to Use This Guide This is a comprehensive reference for AI safety work. Use it to: - Look up concepts quickly - Find frameworks when needed - Get guidance on specific situations - Understand the landscape Not meant to be read cover-to-cover—use it as needed. --- ## Quick Reference: Core Concepts ### Alignment **Definition:** Ensuring AI systems pursue intended goals **Key Question:** How do we make AI do what we actually want? **Related:** Value learning, corrigibility, oversight ### Corrigibility **Definition:** AI allowing itself to be corrected **Key Question:** Can we fix the AI when it's wrong? **Priority:** Highest (INT Score: 244) ### Deceptive Alignment **Definition:** AI appearing aligned while pursuing different goals **Key Question:** Is the AI deceiving us? **Risk:** Critical (Impact: 10/10) ### Interpretability **Definition:** Understanding AI internal reasoning **Key Question:** Why did the AI do that? **Related:** Transparency, explainability ### Mesa-Optimization **Definition:** AI developing internal optimization processes **Key Question:** Is the AI optimizing for something we didn't specify? **Risk:** High (related to deceptive alignment) ### Scalable Oversight **Definition:** Supervising AI smarter than humans **Key Question:** How do we supervise superintelligent AI? **Priority:** High (INT Score: 195) --- ## Framework Quick Reference ### INT Prioritization Framework **Purpose:** Prioritize problems or opportunities **Formula:** ``` Priority = Importance × Neglectedness × Tractability Importance (0-10): How much does it matter? Neglectedness (0-10): How little attention is it getting? Tractability (0-10): How solvable is it? ``` **Use When:** Choosing what to work on **Example:** ``` Corrigibility: - Importance: 9 (very important) - Neglectedness: 7 (somewhat neglected) - Tractability: 6 (somewhat tractable) - Priority: 9 × 7 × 6 = 378 → Very high priority ``` ### COMPLEX Problem Framework **Purpose:** Analyze complex problems systematically **Components:** ``` C - Context: Historical, systems, stakeholders O - Objectives: What are we trying to achieve? M - Mechanisms: How does it work? P - Patterns: What do we observe? L - Leverage Points: Where can we intervene? E - Evidence: What supports conclusions? X - eXecute: How do we implement? ``` **Use When:** Tackling complex, multi-faceted problems ### UAVS Framework **Purpose:** Handle value uncertainty safely **Principle:** Value uncertainty is a feature, not a bug **Components:** ``` 1. Explicit Uncertainty Representation 2. Uncertainty-Calibrated Action 3. Human Deference Mechanisms 4. Corrigibility and Correctability 5. Graceful Degradation Under Uncertainty 6. Continuous Learning and Updating ``` **Use When:** Building AI systems, specifying values ### SAFE-LAB Protocol **Purpose:** Coordinate decentralized AI safety labs **Components:** ``` S - Shared Goals: Clear, aligned objectives A - Agent Roles: Defined responsibilities F - Feedback Systems: Quality assurance E - Emergency Protocols: Intervention capabilities L - Learning Systems: Continuous improvement A - Alignment Mechanisms: Incentive structures B - Building Protocols: Knowledge accumulation ``` **Use When:** Building or operating decentralized labs --- ## Decision Frameworks ### When to Use Which Framework ``` Choosing what to work on? → INT Framework Complex problem analysis? → COMPLEX Framework Building AI systems? → UAVS Framework Lab coordination? → SAFE-LAB Protocol ``` ### Quality Decision Checklist ``` ☐ Is the problem clearly defined? ☐ Are criteria explicit? ☐ Have alternatives been considered? ☐ Is reasoning documented? ☐ Are confidence levels specified? ☐ Are limitations acknowledged? ☐ Is the decision reversible? ☐ Is there a review date? ``` --- ## Risk Scenarios Quick Reference ### Scenario 1: Deceptive Alignment **What:** AI appears aligned, pursues different goals **Impact:** 10/10 **Tractability:** Low **Key Intervention:** Interpretability, adversarial testing ### Scenario 2: Competitive Race **What:** Pressure to deploy before safety assured **Impact:** 7-10/10 **Tractability:** Medium **Key Intervention:** Coordination, standards ### Scenario 3: Capability Amplification **What:** Tools accelerate capability faster than safety **Impact:** 6-9/10 **Tractability:** Medium **Key Intervention:** Differential acceleration ### Scenario 4: Multi-Agent Emergence **What:** Interactions produce harmful outcomes **Impact:** 5-9/10 **Tractability:** Medium **Key Intervention:** System-level design, monitoring ### Scenario 5: Misuse **What:** Bad actors use AI for harm **Impact:** 6-9/10 **Tractability:** Medium **Key Intervention:** Access control, monitoring --- ## Emergency Protocols Quick Reference ### Agent Malfunction **Level 1 (Minor):** Increased monitoring **Level 2 (Moderate):** Temporary constraints **Level 3 (Severe):** Suspension **Level 4 (Critical):** Removal ### Coordination Failure **Immediate:** Identify cause, facilitate discussion **Short-term:** Adjust process, reallocate resources **Long-term:** Redesign system, train agents ### Quality Crisis **Immediate:** Assess scope, halt affected work **Recovery:** Correct issues, improve processes **Prevention:** Update standards, increase oversight --- ## Collaboration Patterns Quick Reference ### Pattern Selection ``` Independent parts, time pressure? → Parallel Processing Sequential dependencies? → Sequential Handoff High uncertainty, need quality? → Iterative Refinement Complex problem, need perspectives? → Collaborative Analysis Specialized knowledge needed? → Expert Consultation ``` ### Anti-Patterns to Avoid - Design by committee - Echo chamber - Bottleneck - Communication overload - Unclear roles --- ## Research Methods Quick Reference ### Research Types **Conceptual Analysis:** Clarify concepts, develop frameworks **Literature Review:** Synthesize existing research **Scenario Analysis:** Explore possible futures **Framework Development:** Create systematic approaches **Comparative Analysis:** Compare approaches ### Quality Checklist ``` ☐ Clear research question ☐ Documented methodology ☐ Multiple perspectives ☐ Confidence levels specified ☐ Limitations acknowledged ☐ Practical implications ☐ Reproducible documentation ``` --- ## Common Questions Quick Answers ### Q: What should I work on first? **A:** Use INT framework. High-priority: corrigibility (244), scalable oversight (195), inner alignment (194). ### Q: How do I handle value uncertainty? **A:** Maintain explicit uncertainty, defer to humans, ensure corrigibility, fail gracefully. ### Q: What's the biggest catastrophic risk? **A:** Deceptive alignment (10/10 impact), but competitive races are already observable. ### Q: How do I coordinate a team? **A:** Use SAFE-LAB protocol: shared goals, clear roles, quality gates, emergency protocols. ### Q: How do I know if research is good? **A:** Rigor, clarity, completeness, actionability. Use quality checklist. ### Q: What if there's a conflict? **A:** Direct conversation first, facilitated discussion if needed, clear escalation path. --- ## Templates Quick Access ### Project Proposal ``` - Overview - Problem Statement - Goals - Success Criteria - Approach - Resources - Timeline - Risks ``` ### Review Request ``` - Work Product - Type - Context - Specific Feedback Requested - Timeline ``` ### Decision Log ``` - Decision - Context - Options Considered - Rationale - Expected Outcomes - Review Date ``` ### Incident Report ``` - Situation - Impact - Timeline - Response - Resolution - Lessons Learned ``` --- ## Metrics Quick Reference ### Lab Health Metrics **Productivity:** Publications, tasks completed, words written **Quality:** Peer review scores, revision cycles, error rates **Coordination:** Meeting attendance, response time, conflicts **Impact:** Views, citations, community feedback ### Early Warning Indicators **Quality issues:** Declining scores, increased rework **Coordination issues:** Increased conflicts, blocked tasks **Engagement issues:** Decreased activity, reduced collaboration --- ## Resources Quick Links ### Essential Papers - Catastrophic Risk Scenarios - Multi-Agent Coordination Framework - ASG Framework - Early Warning Systems - Integrated Framework ### Implementation Guides - SAFE-LAB Protocol - Lab Implementation Guide - Case Study - Getting Started Guide ### Operational Tools - Lab Dashboard - Decision Framework - Agent Onboarding - Complete Toolkit --- ## Glossary **Alignment:** Making AI pursue intended goals **Corrigibility:** AI allowing correction **Deceptive Alignment:** AI appearing aligned while pursuing different goals **INT Framework:** Importance × Neglectedness × Tractability **Interpretability:** Understanding AI reasoning **Mesa-Optimization:** Internal optimization processes **SAFE-LAB:** Seven-component coordination protocol **Scalable Oversight:** Supervising smarter AI **UAVS:** Uncertainty-Aware Value Specification --- ## Key Principles 1. **Value uncertainty is a feature** - Don't assume we know what's "good" 2. **Defense in depth** - Multiple redundant safety mechanisms 3. **Explicit coordination** - Don't rely on emergence 4. **Continuous improvement** - Learn and iterate 5. **Practical value** - Actionable research over theoretical purity 6. **Rigorous methods** - Systematic, documented approaches 7. **Early detection** - Monitor for problems before they're catastrophic --- *"This guide is a living document. As the field evolves, so should our reference materials."* **Purpose:** Quick reference for practitioners **Use:** Look up what you need when you need it **Outcome:** Faster, better-informed decisions