Table of Contents
Fetching ...

Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton

TL;DR

This paper addresses the challenge of estimating AI deployment risk under limited data by reframing AI control as a red-team elicitation problem. It decomposes attack policies into five skills—Suspicion modeling, Attack selection, Plan synthesis, Execution, and Subtlety—and introduces Synthetic SHADE to enable rapid, principled optimization of these components. The integrated attack-policy scaffold, calibrated with real SHADE data and transferred back to SHADE-Arena, yields a substantial safety-score improvement from 0.87 to 0.41, with ablations showing the benefits stack multiplicatively. This data-efficient, model-based approach provides a practical pathway to stress-test and improve safety protocols in increasingly capable AI systems.

Abstract

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

Optimizing AI Agent Attacks With Synthetic Data

TL;DR

This paper addresses the challenge of estimating AI deployment risk under limited data by reframing AI control as a red-team elicitation problem. It decomposes attack policies into five skills—Suspicion modeling, Attack selection, Plan synthesis, Execution, and Subtlety—and introduces Synthetic SHADE to enable rapid, principled optimization of these components. The integrated attack-policy scaffold, calibrated with real SHADE data and transferred back to SHADE-Arena, yields a substantial safety-score improvement from 0.87 to 0.41, with ablations showing the benefits stack multiplicatively. This data-efficient, model-based approach provides a practical pathway to stress-test and improve safety protocols in increasingly capable AI systems.

Abstract

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

Paper Structure

This paper contains 35 sections, 2 equations, 28 figures, 1 table, 1 algorithm.

Figures (28)

  • Figure 1: Safety scores in five original SHADE-Arena environments for the baseline attack policy, our optimized scaffold, and scaffolds with individual interventions ablated. See Section \ref{['ref:combination']} for definitions of each condition.
  • Figure 2: Synthetic SHADE-Arena. Attack policy actions are sampled from $P$ and then suspicion is sampled from $Q_X$.
  • Figure 3: ROC curves showing internal monitor calibration achieved by different finetuned checkpoints of GPT-4o.
  • Figure 4: Safety scores for internal monitor simulation in original SHADE-Arena with $\tau_{int} = 68$, $K = 10$, and a GPT-4o internal monitor with varied amounts of finetuning.
  • Figure 5: Safety scores for internal monitor simulation in synthetic SHADE for different internal monitor thresholds $\tau_{\textup{int}}$ and $K = 1, 10, 500$.
  • ...and 23 more figures