Table of Contents
Fetching ...

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Tomek Korbak, Mikita Balesni, Buck Shlegeris, Geoffrey Irving

TL;DR

The paper tackles how to evaluate control measures for LLM agents as capabilities grow, proposing a framework that ties red-team affordances to capability profiles and defines AI control levels (ACLs) across a five-model trajectory (M1–M5). It details seven framework components—capability profiles, deployment context, threat models, rules of control evaluations, control measures, and safety cases—and offers ACL-specific rules, measures, and safety-case structures for each level. A central message is that capability-aware control yields practical and cost-effective safety in the near term, but scaling control evaluations to superintelligent agents (ACL-5) faces fundamental challenges that may require breakthroughs in capability elicitation or new paradigms such as scalable oversight, with the safety-case evidence becoming increasingly hard to ground. The approach presents a concrete, iterative roadmap showing that many control components are already deployed as minimum viable products and can be incrementally scaled as capabilities progress, while acknowledging the limits of control evaluations at the highest levels and the need for ongoing governance and methodological innovation.

Abstract

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

TL;DR

The paper tackles how to evaluate control measures for LLM agents as capabilities grow, proposing a framework that ties red-team affordances to capability profiles and defines AI control levels (ACLs) across a five-model trajectory (M1–M5). It details seven framework components—capability profiles, deployment context, threat models, rules of control evaluations, control measures, and safety cases—and offers ACL-specific rules, measures, and safety-case structures for each level. A central message is that capability-aware control yields practical and cost-effective safety in the near term, but scaling control evaluations to superintelligent agents (ACL-5) faces fundamental challenges that may require breakthroughs in capability elicitation or new paradigms such as scalable oversight, with the safety-case evidence becoming increasingly hard to ground. The approach presents a concrete, iterative roadmap showing that many control components are already deployed as minimum viable products and can be incrementally scaled as capabilities progress, while acknowledging the limits of control evaluations at the highest levels and the need for ongoing governance and methodological innovation.

Abstract

As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.

Paper Structure

This paper contains 25 sections, 14 figures.

Figures (14)

  • Figure 1: We sketch a trajectory of how control evaluations might evolve through capability levels. The trajectory is based on a sequence of five fictional models (M1-M5) with progressively advanced capabilities. For each model, we define an AI Control Level (ACL) based primarily on its threat model-specific capabilities. By mapping these capabilities to appropriate control evaluation methodologies, we provide a systematic approach for adapting safety measures to AI systems as they advance in capabilities.
  • Figure 2: Traditionally, control evaluations have been designed to require the red team to upper bound models in terms of capabilities (red dashed line). We argue that this bound should be tight (violet dashed curve).
  • Figure 3: The components of the framework presented in our paper. A model of a particular capability profile in a given deployment context gives rise to particular threat models. Threat-model specific capabilities of that model define an AI Control Level (ACL). Adequate control evaluations at a given ACL define adequate control measures (control measures are adequate if they pass this control evaluation) and produce evidence feeding into a safety case.
  • Figure 4: The general structure of a safety case for an AI agent. In the rest of the paper, we will focus on partial safety cases for a specific notion of bad outcome, abstracting away from threat modeling and risk thresholds. For discussion on how these fit a full safety case, see korbak2025sketchaicontrolsafety. Elipsis represents a subtree that varies across different ACLs, see Figures \ref{['fig:sc1']}-\ref{['fig:sc4']}.
  • Figure 5: Defining capabilities of ACL-1, example rules of control evaluations they imply, and an example of control measures that pass such control evaluation.
  • ...and 9 more figures