Table of Contents
Fetching ...

A sketch of an AI control safety case

Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving

TL;DR

The paper proposes a safety-case sketch for internally deployed LLM agents, grounded in AI-control principles, to bound data exfiltration risk through a structured control protocol and evidence from control evaluations. It introduces a five-phase control-evaluation workflow (phase specification, deployment proxy, model substitution, deployment proxy, and extrapolation) using red/blue teams and deployment proxies (microgames/macrogames) to produce conservative risk estimates. It details the safety-layer architecture, periodic evaluations, and communication constraints that collectively prevent unacceptable outcomes, while acknowledging limitations, potential attack patterns, and the costs of implementation. The work offers a blueprint for rigorous, evidence-based assurance in frontier AI deployments and invites broader dialogue among industry, policymakers, and civil society on safety-case development for varied deployment scenarios.

Abstract

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

A sketch of an AI control safety case

TL;DR

The paper proposes a safety-case sketch for internally deployed LLM agents, grounded in AI-control principles, to bound data exfiltration risk through a structured control protocol and evidence from control evaluations. It introduces a five-phase control-evaluation workflow (phase specification, deployment proxy, model substitution, deployment proxy, and extrapolation) using red/blue teams and deployment proxies (microgames/macrogames) to produce conservative risk estimates. It details the safety-layer architecture, periodic evaluations, and communication constraints that collectively prevent unacceptable outcomes, while acknowledging limitations, potential attack patterns, and the costs of implementation. The work offers a blueprint for rigorous, evidence-based assurance in frontier AI deployments and invites broader dialogue among industry, policymakers, and civil society on safety-case development for varied deployment scenarios.

Abstract

As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.

Paper Structure

This paper contains 44 sections, 2 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: An illustration of the control evaluation that provides the core evidence used in our control safety case sketch. During the control evaluation, a red team constructs adversarial substitute models to cause unacceptable outcomes in a deployment proxy consisting of 2,000 AI research tasks, followed by extrapolation to a larger deployment setting.
  • Figure 2: The core claims of the AI control safety case sketch expressed in Claims, Arguments, Evidence notation bloomfield2014building. The remaining sections of the CAE diagram are discussed throughout the body of the paper (Figures \ref{['fig:threat_models1']}, \ref{['fig:control_measures1']}, \ref{['fig:red_team1']}, and \ref{['fig:extrapolation2']}). For the full argument structure, see \ref{['app:full_diagram']}.
  • Figure 3: The architecture of the LLM agent, adapted from shlegeris2024systems. Each box represents a separate physical machine or collection of machines; arrows represent HTTP requests.
  • Figure 4: This safety case sketch is written in the context of a hypothetical, more thorough safety case that would consider multiple threat models. This sketch focuses only on exfiltration of sensitive data (e.g. code) to the internet.
  • Figure 5: The safety layer we assume in this safety case sketch. The input to the policy is rejected if it is deemed out of distribution. If the input is valid, the safety layer first uses an untrusted policy to generate an action. The output of the policy is assigned a suspiciousness score by a monitor ensemble. The suspiciousness scores are compared with accept sets for 'execute,' 'edit,' or 'audit' to determine what should be done with each action
  • ...and 14 more figures