An Example Safety Case for Safeguards Against Misuse
Joshua Clymer, Jonah Weinbaum, Robert Kirk, Kimberly Mai, Selena Zhang, Xander Davies
TL;DR
The paper tackles the challenge of turning AI misuse safeguard evaluations into actionable risk decisions by presenting an end-to-end safety case. It defines a top-level claim [C0] that safeguards reduce misuse risk below a threshold $T$, then details threat modelling, concrete safeguards, and a quantitative uplift model that computes post-mitigation risk via $U(A_{post}) = R(A_{post}) - R(A_{none})$ with $R(A_{none}) = a D \int_0^{\infty} p_{A_{pre}}(t) f_T(t)\,dt$ and $R(A_{post}) = a D \int_0^{\infty} p_{A_{post}}(t) f_T(t)\,dt$. The methodology covers structured safeguard evaluations, aggregating results into extrapolated risk curves, and using Monte Carlo simulations to explore deployment scenarios, including a one-month grace period for responses. This uplift-based approach provides a practical, continuous-risk signal for deployment decisions, while acknowledging uncertainties and the need for ongoing threat reassessment. Overall, the work offers a concrete path to justify low misuse risk in AI systems through a formal safety-case framework.
Abstract
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.
