Table of Contents
Fetching ...

Safety Cases: How to Justify the Safety of Advanced AI Systems

Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen

TL;DR

The paper proposes a structured AI safety-case framework to justify deploying advanced AI by decomposing risk into unacceptable outcomes and analyzing the macrosystem via subsystems. It introduces four safety-argument categories—inability, control, trustworthiness, and deference—and provides building-block arguments, evaluation criteria, and templates (e.g., not-alignment-faking, eliciting latent knowledge) to assess safety. It also recommends institutional practices like risk/risk-case reviews, continuous monitoring, and moving toward hard standards and formal processes. The work outlines a holistic, GS N-based approach and suggests probabilistic aggregation when feasible, while highlighting the need for further research in capability evaluation and trustworthy behavior. Practically, it aims to guide regulators and developers in creating auditable safety cases and in coordinating risk assessment for deployment decisions.

Abstract

As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.

Safety Cases: How to Justify the Safety of Advanced AI Systems

TL;DR

The paper proposes a structured AI safety-case framework to justify deploying advanced AI by decomposing risk into unacceptable outcomes and analyzing the macrosystem via subsystems. It introduces four safety-argument categories—inability, control, trustworthiness, and deference—and provides building-block arguments, evaluation criteria, and templates (e.g., not-alignment-faking, eliciting latent knowledge) to assess safety. It also recommends institutional practices like risk/risk-case reviews, continuous monitoring, and moving toward hard standards and formal processes. The work outlines a holistic, GS N-based approach and suggests probabilistic aggregation when feasible, while highlighting the need for further research in capability evaluation and trustworthy behavior. Practically, it aims to guide regulators and developers in creating auditable safety cases and in coordinating risk assessment for deployment decisions.

Abstract

As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
Paper Structure (54 sections, 17 figures, 1 table)

This paper contains 54 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: the GSN diagram is the start of a holistic safety case \ref{['sec:holistic']}. The decomposition above would occur in step 2 (specifying unacceptable outcomes). ‘G’ labeled rectangles represent subclaims (i.e. goals). ‘S’ labeled parallelograms indicate justification strategies. For an example of a full safety case, see section \ref{['sec:holistic']}.
  • Figure 2: As AI systems become more powerful, developers will likely increasingly rely on arguments toward the right in the plot above.
  • Figure 3: building block arguments for making safety cases. See section \ref{['sec:arguments']} for detailed descriptions of each building block argument and justifications for their labels. The arguments that rank highest on practicality, strength, and scalability are monitoring, externalized reasoning, and testbeds.
  • Figure 4: The diagram above is a ‘risk matrix’ used in the aviation industry riskmatrix. If boxes in red are checked, risk is considered unacceptably high.
  • Figure 5: We recommend that regulators review risk cases along with safety cases, effectively putting AI systems “on trial.”
  • ...and 12 more figures