Safety Cases: How to Justify the Safety of Advanced AI Systems

Joshua Clymer; Nick Gabrieli; David Krueger; Thomas Larsen

Safety Cases: How to Justify the Safety of Advanced AI Systems

Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen

TL;DR

The paper proposes a structured AI safety-case framework to justify deploying advanced AI by decomposing risk into unacceptable outcomes and analyzing the macrosystem via subsystems. It introduces four safety-argument categories—inability, control, trustworthiness, and deference—and provides building-block arguments, evaluation criteria, and templates (e.g., not-alignment-faking, eliciting latent knowledge) to assess safety. It also recommends institutional practices like risk/risk-case reviews, continuous monitoring, and moving toward hard standards and formal processes. The work outlines a holistic, GS N-based approach and suggests probabilistic aggregation when feasible, while highlighting the need for further research in capability evaluation and trustworthy behavior. Practically, it aims to guide regulators and developers in creating auditable safety cases and in coordinating risk assessment for deployment decisions.

Abstract

As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.

Safety Cases: How to Justify the Safety of Advanced AI Systems

TL;DR

Abstract

Paper Structure (54 sections, 17 figures, 1 table)

This paper contains 54 sections, 17 figures, 1 table.

Introduction
Executive Summary
A framework for structuring an AI safety case
Safety argument categories
Safety argument examples
Defining Terms
Recommendations for Institutions Using Safety Cases
Safety arguments: building blocks for safety cases
Desiderata for safety arguments
Inability Arguments
Dangerous Capability Evaluations
Control Arguments
Control argument examples
Trustworthiness Arguments
Black Swans
...and 39 more sections

Figures (17)

Figure 1: the GSN diagram is the start of a holistic safety case \ref{['sec:holistic']}. The decomposition above would occur in step 2 (specifying unacceptable outcomes). ‘G’ labeled rectangles represent subclaims (i.e. goals). ‘S’ labeled parallelograms indicate justification strategies. For an example of a full safety case, see section \ref{['sec:holistic']}.
Figure 2: As AI systems become more powerful, developers will likely increasingly rely on arguments toward the right in the plot above.
Figure 3: building block arguments for making safety cases. See section \ref{['sec:arguments']} for detailed descriptions of each building block argument and justifications for their labels. The arguments that rank highest on practicality, strength, and scalability are monitoring, externalized reasoning, and testbeds.
Figure 4: The diagram above is a ‘risk matrix’ used in the aviation industry riskmatrix. If boxes in red are checked, risk is considered unacceptably high.
Figure 5: We recommend that regulators review risk cases along with safety cases, effectively putting AI systems “on trial.”
...and 12 more figures

Safety Cases: How to Justify the Safety of Advanced AI Systems

TL;DR

Abstract

Safety Cases: How to Justify the Safety of Advanced AI Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (17)