Table of Contents
Fetching ...

An Example Safety Case for Safeguards Against Misuse

Joshua Clymer, Jonah Weinbaum, Robert Kirk, Kimberly Mai, Selena Zhang, Xander Davies

TL;DR

The paper tackles the challenge of turning AI misuse safeguard evaluations into actionable risk decisions by presenting an end-to-end safety case. It defines a top-level claim [C0] that safeguards reduce misuse risk below a threshold $T$, then details threat modelling, concrete safeguards, and a quantitative uplift model that computes post-mitigation risk via $U(A_{post}) = R(A_{post}) - R(A_{none})$ with $R(A_{none}) = a D \int_0^{\infty} p_{A_{pre}}(t) f_T(t)\,dt$ and $R(A_{post}) = a D \int_0^{\infty} p_{A_{post}}(t) f_T(t)\,dt$. The methodology covers structured safeguard evaluations, aggregating results into extrapolated risk curves, and using Monte Carlo simulations to explore deployment scenarios, including a one-month grace period for responses. This uplift-based approach provides a practical, continuous-risk signal for deployment decisions, while acknowledging uncertainties and the need for ongoing threat reassessment. Overall, the work offers a concrete path to justify low misuse risk in AI systems through a formal safety-case framework.

Abstract

Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.

An Example Safety Case for Safeguards Against Misuse

TL;DR

The paper tackles the challenge of turning AI misuse safeguard evaluations into actionable risk decisions by presenting an end-to-end safety case. It defines a top-level claim [C0] that safeguards reduce misuse risk below a threshold , then details threat modelling, concrete safeguards, and a quantitative uplift model that computes post-mitigation risk via with and . The methodology covers structured safeguard evaluations, aggregating results into extrapolated risk curves, and using Monte Carlo simulations to explore deployment scenarios, including a one-month grace period for responses. This uplift-based approach provides a practical, continuous-risk signal for deployment decisions, while acknowledging uncertainties and the need for ongoing threat reassessment. Overall, the work offers a concrete path to justify low misuse risk in AI systems through a formal safety-case framework.

Abstract

Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.

Paper Structure

This paper contains 26 sections, 4 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 2: Steps of a safeguards evaluation.
  • Figure 3: Mapping qualitative risk levels onto quantitative thresholds.
  • Figure 4: An overview of misuse safeguards.
  • Figure 5: Running the safeguard evaluation. The red team attempts to fulfill harmful requests using the safeguarded AI assistant.
  • Figure 6: Performing a safeguard evaluation that accounts for the time-cost of bans.
  • ...and 9 more figures