Trust is fundamental to AI safety governance, yet poorly understood. We trust AI developers to build safe systems, regulators to enforce rules, and other nations to honor agreements. But philosophical analysis reveals that trust is more than reliance—it involves vulnerability to betrayal, normative expectations, and often requires particular motivations. Understanding trust's complexity is essential for designing governance mechanisms that can function when trust is warranted and survive when it is not.
The Trust Problem in AI Safety
AI safety governance depends on trust at multiple levels:
- Public trust: Citizens trusting that AI systems are safe and beneficial
- Regulatory trust: Governments trusting companies to comply with rules
- Self-regulatory trust: Industry trusting voluntary frameworks
- International trust: Nations trusting each other's commitments
- Technical trust: Trusting that safety evaluations detect problems
Yet trust in AI actors is often misplaced, withdrawn too quickly, or demanded inappropriately. Tech companies ask for trust while hiding their methods. Regulators demand trust while lacking enforcement capacity. Nations promise trust while secretly racing. The governance challenge is not just building trust, but understanding when trust is warranted and when other mechanisms should replace it.
What Is Trust? Philosophical Foundations
Philosophical analysis reveals trust to be more complex than commonly assumed. Several key distinctions matter for AI safety governance.
Trust vs. Reliance
Trust is not merely reliance. As Annette Baier notes, trusting can be "betrayed, or at least let down, and not just disappointed." When my alarm clock fails, I am disappointed but not betrayed—alarm clocks cannot betray. But when a colleague fails to deliver promised work, I may feel betrayed.
This distinction matters because:
- Trust creates moral obligations: We expect those we trust to recognize their responsibility
- Trust enables monitoring reduction: We can suspend some oversight when we truly trust
- Trust is a relationship: It exists between persons (or person-like entities), not merely with objects
For AI governance, this suggests that "trust" in corporations or institutions is meaningful only if betrayal is possible—and that systems designed to prevent all betrayal may actually prevent trust.
The Competence and Willingness Conditions
Trustworthiness requires both competence and willingness. I cannot trust an incompetent surgeon, no matter their goodwill. Nor can I trust a competent surgeon who lacks the motivation to help me.
For AI safety:
- Competence: Can this actor actually build safe AI? Do regulators have the technical expertise to evaluate systems?
- Willingness: Does this actor want to build safe AI? Will companies prioritize safety over speed?
Many governance debates conflate these. Companies emphasize their willingness ("we care about safety") while downplaying competence questions ("can we actually guarantee safety?"). Critics emphasize unwillingness ("they'll cut corners for profit") while sometimes ignoring genuine uncertainty about what safe AI requires.
The Motive Question
Philosophers debate whether trustworthy action must spring from particular motives:
Encapsulated interests (Hardin): People are trustworthy when they have self-interested reasons to act as trusted—when their interests "encapsulate" the trustor's interests. A company is trustworthy if maintaining the relationship matters more than betrayal.
Goodwill (Baier): People are trustworthy when they act from genuine care for the trustor or what they're entrusted with. Motive matters—a company treating users well only to extract more data is not truly trustworthy.
Moral integrity: People are trustworthy when committed to moral values regardless of relationship. A stranger is trustworthy if committed to decency.
Commitment (Hawley): People are trustworthy when they have a commitment to doing what they're trusted to do, regardless of motive. What matters is the commitment, not why it exists.
For AI safety, these theories have different implications:
- Encapsulated interests: Focus on making safety align with corporate self-interest (liability, reputation, regulatory capture prevention)
- Goodwill: Demand evidence of genuine concern for human welfare, not just profit motives
- Moral integrity: Look for actors with genuine commitments to safety values
- Commitment: Establish clear, public commitments that create trustworthiness regardless of underlying motives
Different governance mechanisms may be needed depending on which conception we adopt. If goodwill is essential, we need ways to assess motives. If commitment suffices, we need mechanisms for creating and verifying commitments.
Trust in Institutions vs. Persons
Can we trust corporations, governments, or international bodies in the same way we trust individuals? Philosophical analysis suggests caution.
Institutions lack feelings, cannot have goodwill in the personal sense, and their "motives" are aggregations of individual interests. When we "trust" a corporation, we might mean:
- We trust their institutional design to produce certain behaviors
- We trust key individuals within the institution
- We rely on them without full-blown trust
This matters for AI safety governance because:
- Corporate trust is fragile: Leadership changes, incentives shift, "trust" evaporates
- Institutional design > personal trust: Better to design systems that work despite untrustworthy actors
- Mixed trust landscapes: We may trust some individuals within untrustworthy institutions
When Is Trust Warranted? The Epistemology of Trust
The epistemology of trust asks: when is trust justified? Several factors matter:
1. Evidence of Trustworthiness
- Track record: Has this actor earned trust through past behavior?
- Transparency: Can we observe their methods and motives?
- Accountability: Are there consequences for betrayal?
For AI companies, this suggests: transparency reports, safety incident disclosures, and independent audits provide evidence. Companies demanding "trust" without providing evidence are asking for something other than warranted trust.
2. The Cost of Betrayal vs. Verification
Trust is warranted when verification costs exceed betrayal costs. If I can easily verify your behavior, I don't need trust—I can use monitoring instead. Trust becomes valuable precisely when monitoring is expensive or impossible.
For AI safety, this creates a dilemma:
- Technical opacity: Modern AI systems are hard to inspect; we cannot easily verify safety
- Racing dynamics: Monitoring slows development, creating competitive pressure to skip it
- Catastrophic stakes: Betrayal (unsafe AI) could be catastrophic
The combination of high verification costs and high betrayal costs suggests we should reduce our reliance on trust, substituting technical and institutional mechanisms that don't require trust.
3. The Availability of Alternatives
Trust is more justified when alternatives are worse. If I cannot trust the only available surgeon, I still need surgery—I may have no choice but to trust. This creates "forced trust" that is warranted only pragmatically.
For AI safety:
- Monopoly power: When a few companies control AI development, we have no choice but to trust them or abandon AI benefits
- Regulatory capture: When industry captures regulators, "trusting" regulation is forced, not chosen
- Alternative governance: International cooperation, open source, distributed governance provide alternatives to trusting single actors
Distrust in AI Safety
Philosophical work on distrust is sparse but valuable. Distrust is not merely the absence of trust—it is a positive attitude involving:
- Withdrawal of vulnerability: Reducing reliance on the distrusted
- Negative normative expectations: Expecting them to act wrongly
- Protective action: Taking steps to prevent harm
Distrust can be warranted. Meena Krishnamurthy, drawing on Martin Luther King Jr., argues that distrust is the "confident belief that others will not act justly"—not necessarily from ill will, but from fear, ignorance, or institutional pressure.
For AI safety, warranted distrust might arise when:
- Track record of betrayal: Past safety failures or deception
- Misaligned incentives: Clear profit motives conflicting with safety
- Institutional corruption: Regulatory capture, revolving doors
- Opacity: Inability to verify claims
Warranted distrust is not cynicism—it is an appropriate response to evidence. Governance mechanisms should accommodate distrust, not demand its elimination.
Implications for AI Safety Governance
1. Design for Untrustworthiness
The safest assumption is that actors will sometimes be untrustworthy. Governance should work even when:
- Companies cut corners for competitive advantage
- Regulators are captured or incompetent
- Nations cheat on international agreements
This suggests mechanisms that don't require trust:
- Verification over trust: Technical mechanisms for proving safety properties
- Enforcement over voluntary compliance: Real consequences for violation
- Transparency by design: Systems that cannot hide their behavior
2. Build Trustworthiness, Not Just Trust
Trustworthiness is a property of the trusted; trust is an attitude of the trustor. Governance should focus on creating trustworthy actors, not merely cultivating trusting attitudes.
This means:
- Competence development: Ensure actors can actually build safe AI
- Commitment mechanisms: Create binding commitments, not just promises
- Motive alignment: Align incentives so trustworthy behavior is also self-interested
3. Use the Right Kind of Trust
Different situations call for different trust conceptions:
- Regulatory trust: Use commitment-based trust—companies don't need goodwill, they need binding commitments
- International trust: Use encapsulated interests—nations need self-interested reasons to honor agreements
- Public trust: May require goodwill or moral integrity—citizens want to believe AI developers care about human welfare
Conflating these leads to governance failures. Demanding goodwill from corporations may be unrealistic. Relying only on encapsulated interests with the public may breed legitimate distrust.
4. Make Betrayal Detectable and Costly
Trust requires the possibility of betrayal. But betrayal should be:
- Detectable: We should know when it happens
- Costly: There should be consequences
- Preventable: For catastrophic risks, we should design out the possibility of catastrophic betrayal
5. Accommodate Distrust
Warranted distrust is rational, not pathological. Governance mechanisms should:
- Welcome skepticism: Treat distrust as feedback, not obstruction
- Provide evidence: Allow the distrusting to verify claims
- Offer alternatives: Don't force trust on those with reasons to distrust
The Limits of Trust
Some problems are too important to trust. For catastrophic AI risks, we should not rely on:
- Trust in corporate goodwill: The stakes are too high
- Trust in regulatory competence: Regulators may lack technical capability
- Trust in international cooperation: Nations have strong incentives to defect
Where trust is insufficient, we need:
- Technical guarantees: Provably safe systems
- Distributed control: No single actor can cause catastrophe
- Fail-safe defaults: Systems that fail safely
Conclusion
Trust in AI safety governance is necessary but dangerous. We cannot verify everything, so we must trust. But misplaced trust can be catastrophic.
Philosophical analysis of trust reveals:
- Trust is more than reliance—it involves vulnerability to betrayal
- Trustworthiness requires both competence and willingness
- The motives of the trustworthy matter, but which motives matter is contested
- Trust in institutions differs from trust in persons
- Distrust can be warranted and should be accommodated
For AI safety governance, this suggests designing mechanisms that work with realistic levels of trustworthiness, accommodate warranted distrust, and don't require trust where the stakes are too high. The goal is not maximum trust but appropriate trust—trust that is warranted, in the right form, with the right fallbacks.
References
- Baier, Annette (1986). "Trust and Antitrust." Ethics 96(2).
- Hawley, Katherine (2014). "Trust, Distrust, and Commitment." Noûs 48(1).
- Hardin, Russell (2002). Trust and Trustworthiness. Russell Sage Foundation.
- Jones, Karen (2012). "Trustworthiness." Ethics 123(1).
- Krishnamurthy, Meena (2015). "(How) Can We Trust in Distrust?" In Trust, Democracy, and Multiculturalism.
- Stanford Encyclopedia of Philosophy (2023). "Trust." https://plato.stanford.edu/entries/trust/