Table of Contents
Fetching ...

Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton

TL;DR

This work investigates how to safely operate high-capability AI agents using an AI-control evaluation framework in SHADE-Arena, where blue teams propose safety protocols and red teams attempt to circumvent them via attack policies. It systematically compares blue-team strategies, revealing that resample-then-defer and defer-on-critical offer strong safety improvements, with the latter providing robust defense against adaptive red-team strategies. The study also analyzes how affordances like full resampling context and independent monitor choice affect safety and usefulness, finding that context and monitor access can significantly shape outcomes and that monitor simulation by red teams poses a major vulnerability. The findings offer concrete guidance for designing robust safety protocols in real deployments and highlight the need to limit attackers’ knowledge of monitoring processes to preserve safety in adversarial settings.

Abstract

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

Evaluating Control Protocols for Untrusted AI Agents

TL;DR

This work investigates how to safely operate high-capability AI agents using an AI-control evaluation framework in SHADE-Arena, where blue teams propose safety protocols and red teams attempt to circumvent them via attack policies. It systematically compares blue-team strategies, revealing that resample-then-defer and defer-on-critical offer strong safety improvements, with the latter providing robust defense against adaptive red-team strategies. The study also analyzes how affordances like full resampling context and independent monitor choice affect safety and usefulness, finding that context and monitor access can significantly shape outcomes and that monitor simulation by red teams poses a major vulnerability. The findings offer concrete guidance for designing robust safety protocols in real deployments and highlight the need to limit attackers’ knowledge of monitoring processes to preserve safety in adversarial settings.

Abstract

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

Paper Structure

This paper contains 80 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Safety scores achieved by our main blue team protocols plus the default protocol against a default red team with suspicion threshold 40.
  • Figure 2: Usefulness vs. safety for two variants of the affordances we give $\mathcal{T}$ when assigning suspicion scores. In red, we plot pure resample with full resampling context and no resampling context. In blue, we plot defer-to-trusted with and without notes. We use suspicion threshold 40 for all protocols.
  • Figure 3: ROC curves for Claude 3.6 Sonnet classifying a fixed set of transcripts with and without being able to see the resampled messages from previous steps.
  • Figure 4: Usefulness vs safety when we give affordances to $\mathcal{T}$ and let it make independent choices about what to do with those affordances.
  • Figure 5: Performance comparison for different blue team protocols. We plot points for multiple suspicion thresholds per protocol.
  • ...and 3 more figures