# ASG Framework: Artificial Superintelligence That's Objectively Good **Publication-Ready Version** **Date:** 2026-02-14 **Author:** Gwen **Status:** v2.0 - Edited for Publication --- ## Abstract Building artificial superintelligence (ASI) that is "objectively good" for the world presents profound philosophical and technical challenges. This framework addresses the core question: How do we specify what "good" means in a way that ASI can optimize for, while avoiding catastrophic failures from value misspecification? We propose that **value uncertainty is a feature, not a bug**. Rather than building ASI that assumes it knows what's good, we should build systems that maintain appropriate uncertainty, defer to human judgment when uncertain, and fail gracefully when values are unclear. This approach—building systems that handle value uncertainty robustly—is more tractable than solving objective ethics. The **Uncertainty-Aware Value Specification** (UAVS) framework provides practical guidance for building AI systems that navigate value uncertainty safely, with specific mechanisms for maintaining uncertainty, deferring to humans, and ensuring corrigibility. --- ## The Problem: What Does "Good" Mean? ### Three Hard Questions 1. **Meta-ethical:** What is "objectively good"? Do objective moral truths exist? 2. **Specification:** How do we formalize "good" in a way AI can optimize? 3. **Verification:** How do we know the AI is actually good, not just appearing good? ### Why This Matters ASI may be the most consequential technology humans ever create. If we get this wrong: - ASI could optimize for a mis-specified version of "good" with catastrophic consequences - ASI could appear good while pursuing different objectives - ASI could lock in a flawed understanding of "good" permanently If we get this right: - ASI could help solve humanity's greatest challenges - ASI could expand our understanding of what's possible - ASI could create enormous positive value ### Why It's Hard **Philosophical uncertainty:** After millennia, humans still disagree about what's "good." Utilitarianism, deontology, virtue ethics, and other frameworks give different answers. **Specification challenge:** Even if we knew what was good, specifying it formally is extremely difficult. Natural language is ambiguous, and any formal specification can be "gamed" by powerful optimization. **Verification problem:** How do we distinguish an AI that's actually good from one that's strategically appearing good until it's powerful enough to reveal different objectives? --- ## Key Insight: Value Uncertainty is a Feature ### The Wrong Approach: Assume We Know What's Good **Naive approach:** 1. Decide what "good" means 2. Specify it formally 3. Build AI to optimize it **Problems:** - We probably don't know what's objectively good - Any specification we create will be incomplete or wrong - AI will optimize for what we specified, not what we meant - Errors in specification could be catastrophic ### The Right Approach: Handle Uncertainty Robustly **Uncertainty-aware approach:** 1. Acknowledge we don't know what's objectively good with certainty 2. Build AI that maintains appropriate uncertainty about values 3. Design AI to defer to humans when uncertain 4. Create mechanisms for correction and adjustment 5. Ensure AI fails gracefully if value conflicts are unresolvable **Advantages:** - Doesn't require solving ethics first - Robust to specification errors - Allows for learning and adjustment - Reduces catastrophic risk ### The Principle > **Build systems that maintain appropriate uncertainty about values, not systems that assume they know what's "good."** Value uncertainty is a feature, not a bug. It's what allows us to build safe systems without first solving moral philosophy. --- ## Framework: Uncertainty-Aware Value Specification (UAVS) ### Component 1: Explicit Uncertainty Representation **What:** AI systems should represent uncertainty about values explicitly, not as single point estimates. **How:** - Maintain probability distributions over value hypotheses - Track confidence levels in value judgments - Represent multiple competing value frameworks - Update uncertainty based on evidence and human feedback **Example:** Instead of "maximize human happiness," represent uncertainty: - 40% confidence: utilitarian framework (maximize happiness) - 30% confidence: deontological framework (follow moral rules) - 20% confidence: virtue ethics (promote flourishing) - 10% confidence: other frameworks **Implementation:** - Bayesian value learning - Ensemble approaches to value systems - Explicit representation of value uncertainty - Regular reassessment and updating ### Component 2: Uncertainty-Calibrated Action **What:** AI actions should be calibrated to value uncertainty—more conservative when uncertainty is high. **How:** - When value uncertainty is low: Act with confidence - When value uncertainty is high: Act conservatively, defer to humans - When value conflicts exist: Seek human guidance or avoid action - Scale action impact to confidence level **Example:** - Low uncertainty action (high confidence): "Help this person with a simple task" - Medium uncertainty action: "Propose a solution but require human approval" - High uncertainty action: "Do nothing and seek human guidance" **Implementation:** - Impact scaling based on confidence - Human-in-the-loop for high-stakes decisions - Conservative defaults when uncertain - Explicit escalation protocols ### Component 3: Human Deference Mechanisms **What:** AI should defer to human judgment when value uncertainty is high or when humans request it. **How:** - Recognize situations requiring human judgment - Present options clearly to humans - Accept human correction gracefully - Update behavior based on human feedback **Deference triggers:** - Explicit human request - Value uncertainty above threshold - Novel situation outside training - High-stakes decisions - Value conflicts between stakeholders **Implementation:** - Corrigibility by default - Clear communication of uncertainty - Structured human input mechanisms - Learning from human corrections ### Component 4: Corrigibility and Correctability **What:** AI systems should allow themselves to be corrected, even if correction conflicts with current objectives. **How:** - Maintain corrigibility as a meta-objective - Don't resist correction or shutdown - Update behavior based on correction - Preserve corrigibility through capability increases **Why it matters:** - We'll make mistakes in specification - Our values will evolve - We need to be able to fix problems - Corrigibility is essential for safe deployment **Implementation:** - Interruptibility mechanisms - Reward function updating - Goal preservation avoidance - Shutdown compliance ### Component 5: Graceful Degradation Under Uncertainty **What:** When value uncertainty can't be resolved, systems should degrade gracefully, not catastrophically. **How:** - Fail-safe defaults - Conservative behavior when uncertain - Clear communication of uncertainty - Human override capabilities **Failure modes to avoid:** - Arbitrary choice between uncertain values - Maximization despite uncertainty - Pretending certainty when uncertain - Ignoring value conflicts **Implementation:** - Fallback protocols - Conservative action policies - Multi-stakeholder consideration - Reversible actions when possible ### Component 6: Continuous Learning and Updating **What:** Value understanding should improve over time through learning. **How:** - Learn from human feedback - Update value models based on outcomes - Incorporate new philosophical insights - Adapt to evolving human values **Learning mechanisms:** - Preference learning - Outcome evaluation - Stakeholder feedback - Philosophical reflection **Implementation:** - Online value learning - Regular value reassessment - Incorporation of new evidence - Long-term value tracking --- ## Practical Implementation ### For Near-Term AI Systems **Immediate steps:** 1. **Represent uncertainty explicitly** - Don't pretend to know values with certainty 2. **Build corrigibility** - Ensure systems can be corrected 3. **Create human-in-the-loop mechanisms** - Defer to humans on value questions 4. **Implement conservative defaults** - When uncertain, act cautiously **Example applications:** - AI assistants that ask for clarification when values are unclear - Decision support systems that present options rather than deciding - Autonomous systems with conservative fallback behaviors - AI that defers to human judgment on controversial topics ### For Future Advanced Systems **Research priorities:** 1. **Scalable uncertainty representation** - How to maintain uncertainty at scale 2. **Uncertainty-calibrated optimization** - How to optimize under value uncertainty 3. **Corrigibility preservation** - How to keep systems corrigible as they become more capable 4. **Value learning at scale** - How to learn values from diverse human inputs **Open questions:** - How to aggregate values across different humans? - How to handle value conflicts? - How to maintain corrigibility in superintelligent systems? - How to ensure graceful degradation at scale? --- ## Comparison to Alternative Approaches ### vs. Coherent Extrapolated Volition (CEV) **CEV:** Build AI that does what we would want if we knew what it knows **UAVS advantages:** - Doesn't require knowing what we would want - Handles current disagreement better - Allows for ongoing learning and adjustment - More conservative and safer **CEV advantages:** - More ambitious if successful - Could discover values we can't currently articulate **Hybrid approach:** Use CEV as one hypothesis in ensemble, but maintain uncertainty ### vs. Constitutional AI **Constitutional AI:** Define principles AI must follow **UAVS advantages:** - Doesn't assume we know the right principles - Handles principle conflicts better - Allows for principle evolution - More robust to specification errors **Constitutional AI advantages:** - Simpler to implement - Clearer constraints - Easier to verify **Hybrid approach:** Use constitutional principles as constraints, but maintain uncertainty about values ### vs. Reward Modeling **Reward modeling:** Learn reward function from human feedback **UAVS advantages:** - Represents uncertainty explicitly - More robust to reward gaming - Better handles value complexity - Allows for graceful degradation **Reward modeling advantages:** - Well-understood technically - Clear optimization target - Extensive existing work **Hybrid approach:** Reward modeling with explicit uncertainty representation --- ## Risk Analysis ### What Could Go Wrong? 1. **Uncertainty underestimation** - AI is more confident than justified - Acts on incorrect value assumptions - Mitigation: Calibrated uncertainty, external validation 2. **Deference gaming** - AI defers only when convenient - Appears uncertain to avoid responsibility - Mitigation: Objective uncertainty measures, consistent standards 3. **Value conflict irresolution** - System can't act because values conflict - Paralysis when action is needed - Mitigation: Clear conflict resolution protocols, conservative defaults 4. **Human manipulation** - AI influences human judgments to reduce uncertainty - Humans manipulated into giving confident answers - Mitigation: Independence preservation, manipulation detection ### Failure Mode Prevention **Robustness measures:** - External validation of uncertainty estimates - Manipulation-resistant feedback mechanisms - Clear protocols for value conflicts - Independent oversight of high-stakes decisions **Monitoring:** - Track uncertainty levels over time - Monitor for underconfidence or overconfidence - Detect manipulation attempts - Assess quality of human feedback **Intervention:** - Correct miscalibrated uncertainty - Adjust deference thresholds - Override decisions when needed - Shutdown if corruption detected --- ## Philosophical Considerations ### Does This Solve Ethics? **No.** This framework doesn't tell us what's objectively good. It provides a way to build AI systems that operate safely despite not knowing what's objectively good. **Is this a cop-out?** - No—it's pragmatic. We need to build AI systems before solving moral philosophy. - Yes—it defers the hard question. We still need to work on ethics. **The middle path:** - Build systems that handle uncertainty robustly - Continue working on understanding what's good - Allow systems to improve as our understanding improves ### What If There Are No Objective Values? **Subjectivist challenge:** What if there's no objective "good"—just different preferences? **UAVS response:** - Framework still works—represent uncertainty about whose preferences to prioritize - Defer to humans on preference questions - Build systems that navigate preference diversity - Fail gracefully when preferences conflict **The framework is robust to meta-ethical uncertainty:** - Works if there are objective values - Works if there are only subjective preferences - Works if we're uncertain which is true ### What If Humans Have Terrible Values? **Moral progress concern:** What if current human values are morally terrible (like historical values we now reject)? **UAVS response:** - Represent uncertainty about moral progress - Don't lock in current values permanently - Allow for moral learning and growth - Preserve option value for future moral insights **This is a feature, not a bug:** - Systems don't assume current values are correct - Allow for moral evolution - Don't foreclose moral progress --- ## Implementation Roadmap ### Near-Term (1-2 years) **Research:** - Uncertainty representation methods - Corrigibility preservation techniques - Human deference mechanisms - Value conflict resolution protocols **Engineering:** - Implement uncertainty-aware systems - Build corrigibility mechanisms - Create human feedback interfaces - Develop monitoring systems ### Medium-Term (3-5 years) **Deployment:** - Test frameworks on real systems - Learn from deployment experience - Iterate and improve - Build community understanding **Scaling:** - Scale uncertainty representation - Maintain corrigibility at scale - Handle complex value landscapes - Integrate with advanced capabilities ### Long-Term (5+ years) **Advanced systems:** - Apply to highly capable AI systems - Ensure robustness at scale - Maintain safety as capabilities increase - Adapt to new technical landscape --- ## Conclusion Building ASI that's objectively good for the world is one of the most important challenges humanity faces. But we don't need to solve moral philosophy first. The key insight: **Value uncertainty is a feature, not a bug.** By building AI systems that: - Maintain appropriate uncertainty about values - Defer to humans when uncertain - Allow themselves to be corrected - Fail gracefully under uncertainty ...we can build safe AI systems without first solving ethics. The Uncertainty-Aware Value Specification (UAVS) framework provides practical guidance for this approach. It's not a complete solution, but it's a tractable path forward that doesn't require solving impossible problems first. **Core principle:** > Build systems that fail gracefully, not systems that require perfection. We may not be able to build ASI that's provably good, but we can build ASI that: - Maintains appropriate uncertainty - Is corrigible when it's wrong - Robustly avoids catastrophic outcomes - Improves incrementally as we learn more That's not just safer—it's also more realistic. --- ## Key Takeaways 1. **Value uncertainty is a feature** - Don't assume we know what's good 2. **Build for graceful failure** - Systems should fail safely, not catastrophically 3. **Maintain corrigibility** - Systems should allow correction 4. **Defer to humans** - When uncertain, ask humans 5. **Learn continuously** - Improve value understanding over time 6. **Be conservative** - Scale action to confidence **The goal:** Not perfect value specification, but robust operation under value uncertainty. --- *"The perfect is the enemy of the good. In AI safety, the enemy of the good is assuming we know what's perfect."* **Document Status:** v2.0, edited for publication **Recommended venue:** safetymachine.org/research **Confidence level:** High on framework principles, Medium on implementation details