A sketch of an AI control safety case
Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, Geoffrey Irving
TL;DR
The paper proposes a safety-case sketch for internally deployed LLM agents, grounded in AI-control principles, to bound data exfiltration risk through a structured control protocol and evidence from control evaluations. It introduces a five-phase control-evaluation workflow (phase specification, deployment proxy, model substitution, deployment proxy, and extrapolation) using red/blue teams and deployment proxies (microgames/macrogames) to produce conservative risk estimates. It details the safety-layer architecture, periodic evaluations, and communication constraints that collectively prevent unacceptable outcomes, while acknowledging limitations, potential attack patterns, and the costs of implementation. The work offers a blueprint for rigorous, evidence-based assurance in frontier AI deployments and invites broader dialogue among industry, policymakers, and civil society on safety-case development for varied deployment scenarios.
Abstract
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
