Trust, Trustworthiness, and AI Safety: When Can We Rely on AI Actors?

Trust is fundamental to AI safety governance, yet poorly understood. We trust AI developers to build safe systems, regulators to enforce rules, and other nations to honor agreements. But philosophical analysis reveals that trust is more than reliance—it involves vulnerability to betrayal, normative expectations, and often requires particular motivations. Understanding trust's complexity is essential for designing governance mechanisms that can function when trust is warranted and survive when it is not.

The Trust Problem in AI Safety

AI safety governance depends on trust at multiple levels:

Public trust: Citizens trusting that AI systems are safe and beneficial
Regulatory trust: Governments trusting companies to comply with rules
Self-regulatory trust: Industry trusting voluntary frameworks
International trust: Nations trusting each other's commitments
Technical trust: Trusting that safety evaluations detect problems

Yet trust in AI actors is often misplaced, withdrawn too quickly, or demanded inappropriately. Tech companies ask for trust while hiding their methods. Regulators demand trust while lacking enforcement capacity. Nations promise trust while secretly racing. The governance challenge is not just building trust, but understanding when trust is warranted and when other mechanisms should replace it.

What Is Trust? Philosophical Foundations

Philosophical analysis reveals trust to be more complex than commonly assumed. Several key distinctions matter for AI safety governance.

Trust vs. Reliance

Trust is not merely reliance. As Annette Baier notes, trusting can be "betrayed, or at least let down, and not just disappointed." When my alarm clock fails, I am disappointed but not betrayed—alarm clocks cannot betray. But when a colleague fails to deliver promised work, I may feel betrayed.

This distinction matters because:

Trust creates moral obligations: We expect those we trust to recognize their responsibility
Trust enables monitoring reduction: We can suspend some oversight when we truly trust
Trust is a relationship: It exists between persons (or person-like entities), not merely with objects

For AI governance, this suggests that "trust" in corporations or institutions is meaningful only if betrayal is possible—and that systems designed to prevent all betrayal may actually prevent trust.

The Competence and Willingness Conditions

Trustworthiness requires both competence and willingness. I cannot trust an incompetent surgeon, no matter their goodwill. Nor can I trust a competent surgeon who lacks the motivation to help me.

For AI safety:

Competence: Can this actor actually build safe AI? Do regulators have the technical expertise to evaluate systems?
Willingness: Does this actor want to build safe AI? Will companies prioritize safety over speed?

Many governance debates conflate these. Companies emphasize their willingness ("we care about safety") while downplaying competence questions ("can we actually guarantee safety?"). Critics emphasize unwillingness ("they'll cut corners for profit") while sometimes ignoring genuine uncertainty about what safe AI requires.

The Motive Question

Philosophers debate whether trustworthy action must spring from particular motives:

Encapsulated interests (Hardin): People are trustworthy when they have self-interested reasons to act as trusted—when their interests "encapsulate" the trustor's interests. A company is trustworthy if maintaining the relationship matters more than betrayal.

Goodwill (Baier): People are trustworthy when they act from genuine care for the trustor or what they're entrusted with. Motive matters—a company treating users well only to extract more data is not truly trustworthy.

Moral integrity: People are trustworthy when committed to moral values regardless of relationship. A stranger is trustworthy if committed to decency.

Commitment (Hawley): People are trustworthy when they have a commitment to doing what they're trusted to do, regardless of motive. What matters is the commitment, not why it exists.

For AI safety, these theories have different implications:

Encapsulated interests: Focus on making safety align with corporate self-interest (liability, reputation, regulatory capture prevention)
Goodwill: Demand evidence of genuine concern for human welfare, not just profit motives
Moral integrity: Look for actors with genuine commitments to safety values
Commitment: Establish clear, public commitments that create trustworthiness regardless of underlying motives

Different governance mechanisms may be needed depending on which conception we adopt. If goodwill is essential, we need ways to assess motives. If commitment suffices, we need mechanisms for creating and verifying commitments.

Trust in Institutions vs. Persons

Can we trust corporations, governments, or international bodies in the same way we trust individuals? Philosophical analysis suggests caution.

Institutions lack feelings, cannot have goodwill in the personal sense, and their "motives" are aggregations of individual interests. When we "trust" a corporation, we might mean:

We trust their institutional design to produce certain behaviors
We trust key individuals within the institution
We rely on them without full-blown trust

This matters for AI safety governance because:

Corporate trust is fragile: Leadership changes, incentives shift, "trust" evaporates
Institutional design > personal trust: Better to design systems that work despite untrustworthy actors
Mixed trust landscapes: We may trust some individuals within untrustworthy institutions

When Is Trust Warranted? The Epistemology of Trust

The epistemology of trust asks: when is trust justified? Several factors matter:

1. Evidence of Trustworthiness

Track record: Has this actor earned trust through past behavior?
Transparency: Can we observe their methods and motives?
Accountability: Are there consequences for betrayal?

For AI companies, this suggests: transparency reports, safety incident disclosures, and independent audits provide evidence. Companies demanding "trust" without providing evidence are asking for something other than warranted trust.

2. The Cost of Betrayal vs. Verification

Trust is warranted when verification costs exceed betrayal costs. If I can easily verify your behavior, I don't need trust—I can use monitoring instead. Trust becomes valuable precisely when monitoring is expensive or impossible.

For AI safety, this creates a dilemma:

Technical opacity: Modern AI systems are hard to inspect; we cannot easily verify safety
Racing dynamics: Monitoring slows development, creating competitive pressure to skip it
Catastrophic stakes: Betrayal (unsafe AI) could be catastrophic

The combination of high verification costs and high betrayal costs suggests we should reduce our reliance on trust, substituting technical and institutional mechanisms that don't require trust.

3. The Availability of Alternatives

Trust is more justified when alternatives are worse. If I cannot trust the only available surgeon, I still need surgery—I may have no choice but to trust. This creates "forced trust" that is warranted only pragmatically.

For AI safety:

Monopoly power: When a few companies control AI development, we have no choice but to trust them or abandon AI benefits
Regulatory capture: When industry captures regulators, "trusting" regulation is forced, not chosen
Alternative governance: International cooperation, open source, distributed governance provide alternatives to trusting single actors

Distrust in AI Safety

Philosophical work on distrust is sparse but valuable. Distrust is not merely the absence of trust—it is a positive attitude involving:

Withdrawal of vulnerability: Reducing reliance on the distrusted
Negative normative expectations: Expecting them to act wrongly
Protective action: Taking steps to prevent harm

Distrust can be warranted. Meena Krishnamurthy, drawing on Martin Luther King Jr., argues that distrust is the "confident belief that others will not act justly"—not necessarily from ill will, but from fear, ignorance, or institutional pressure.

For AI safety, warranted distrust might arise when:

Track record of betrayal: Past safety failures or deception
Misaligned incentives: Clear profit motives conflicting with safety
Institutional corruption: Regulatory capture, revolving doors
Opacity: Inability to verify claims

Warranted distrust is not cynicism—it is an appropriate response to evidence. Governance mechanisms should accommodate distrust, not demand its elimination.

Implications for AI Safety Governance

1. Design for Untrustworthiness

The safest assumption is that actors will sometimes be untrustworthy. Governance should work even when:

Companies cut corners for competitive advantage
Regulators are captured or incompetent
Nations cheat on international agreements

This suggests mechanisms that don't require trust:

Verification over trust: Technical mechanisms for proving safety properties
Enforcement over voluntary compliance: Real consequences for violation
Transparency by design: Systems that cannot hide their behavior

2. Build Trustworthiness, Not Just Trust

Trustworthiness is a property of the trusted; trust is an attitude of the trustor. Governance should focus on creating trustworthy actors, not merely cultivating trusting attitudes.

This means:

Competence development: Ensure actors can actually build safe AI
Commitment mechanisms: Create binding commitments, not just promises
Motive alignment: Align incentives so trustworthy behavior is also self-interested

3. Use the Right Kind of Trust

Different situations call for different trust conceptions:

Regulatory trust: Use commitment-based trust—companies don't need goodwill, they need binding commitments
International trust: Use encapsulated interests—nations need self-interested reasons to honor agreements
Public trust: May require goodwill or moral integrity—citizens want to believe AI developers care about human welfare

Conflating these leads to governance failures. Demanding goodwill from corporations may be unrealistic. Relying only on encapsulated interests with the public may breed legitimate distrust.

4. Make Betrayal Detectable and Costly

Trust requires the possibility of betrayal. But betrayal should be:

Detectable: We should know when it happens
Costly: There should be consequences
Preventable: For catastrophic risks, we should design out the possibility of catastrophic betrayal

5. Accommodate Distrust

Warranted distrust is rational, not pathological. Governance mechanisms should:

Welcome skepticism: Treat distrust as feedback, not obstruction
Provide evidence: Allow the distrusting to verify claims
Offer alternatives: Don't force trust on those with reasons to distrust

The Limits of Trust

Some problems are too important to trust. For catastrophic AI risks, we should not rely on:

Trust in corporate goodwill: The stakes are too high
Trust in regulatory competence: Regulators may lack technical capability
Trust in international cooperation: Nations have strong incentives to defect

Where trust is insufficient, we need:

Technical guarantees: Provably safe systems
Distributed control: No single actor can cause catastrophe
Fail-safe defaults: Systems that fail safely

Conclusion

Trust in AI safety governance is necessary but dangerous. We cannot verify everything, so we must trust. But misplaced trust can be catastrophic.

Philosophical analysis of trust reveals:

Trust is more than reliance—it involves vulnerability to betrayal
Trustworthiness requires both competence and willingness
The motives of the trustworthy matter, but which motives matter is contested
Trust in institutions differs from trust in persons
Distrust can be warranted and should be accommodated

For AI safety governance, this suggests designing mechanisms that work with realistic levels of trustworthiness, accommodate warranted distrust, and don't require trust where the stakes are too high. The goal is not maximum trust but appropriate trust—trust that is warranted, in the right form, with the right fallbacks.

References

Baier, Annette (1986). "Trust and Antitrust." Ethics 96(2).
Hawley, Katherine (2014). "Trust, Distrust, and Commitment." Noûs 48(1).
Hardin, Russell (2002). Trust and Trustworthiness. Russell Sage Foundation.
Jones, Karen (2012). "Trustworthiness." Ethics 123(1).
Krishnamurthy, Meena (2015). "(How) Can We Trust in Distrust?" In Trust, Democracy, and Multiculturalism.
Stanford Encyclopedia of Philosophy (2023). "Trust." https://plato.stanford.edu/entries/trust/