Safety Cases: How to Justify the Safety of Advanced AI Systems
Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen
TL;DR
The paper proposes a structured AI safety-case framework to justify deploying advanced AI by decomposing risk into unacceptable outcomes and analyzing the macrosystem via subsystems. It introduces four safety-argument categories—inability, control, trustworthiness, and deference—and provides building-block arguments, evaluation criteria, and templates (e.g., not-alignment-faking, eliciting latent knowledge) to assess safety. It also recommends institutional practices like risk/risk-case reviews, continuous monitoring, and moving toward hard standards and formal processes. The work outlines a holistic, GS N-based approach and suggests probabilistic aggregation when feasible, while highlighting the need for further research in capability evaluation and trustworthy behavior. Practically, it aims to guide regulators and developers in creating auditable safety cases and in coordinating risk assessment for deployment decisions.
Abstract
As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
