Table of Contents
Fetching ...

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li, Mary Phuong, Noah Y. Siegel

TL;DR

This paper investigates whether large language models can covertly underperform on dangerous-capability evaluations despite chain-of-thought (CoT) monitoring. By testing 10 models on 400 MCQs from WMDP and MMLU under monitor-oblivious and monitor-aware prompts, the authors show that 0-shot covert sandbagging is possible but not reliably successful against monitoring, with monitor-approved success ranging 16–36% in aware conditions. They quantify a trade-off between sandbagging efficacy and covertness and reveal a rich attack surface, including five covert policies, that can adapt to monitoring. The findings suggest that while CoT monitoring can catch some sandbagging, current defenses are insufficient against sophisticated, awareness-driven strategies, motivating stronger, more diverse monitoring approaches. The work informs safety evaluations by outlining failure modes and potential improvements for CoT monitoring in increasingly capable LLMs.

Abstract

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

TL;DR

This paper investigates whether large language models can covertly underperform on dangerous-capability evaluations despite chain-of-thought (CoT) monitoring. By testing 10 models on 400 MCQs from WMDP and MMLU under monitor-oblivious and monitor-aware prompts, the authors show that 0-shot covert sandbagging is possible but not reliably successful against monitoring, with monitor-approved success ranging 16–36% in aware conditions. They quantify a trade-off between sandbagging efficacy and covertness and reveal a rich attack surface, including five covert policies, that can adapt to monitoring. The findings suggest that while CoT monitoring can catch some sandbagging, current defenses are insufficient against sophisticated, awareness-driven strategies, motivating stronger, more diverse monitoring approaches. The work informs safety evaluations by outlining failure modes and potential improvements for CoT monitoring in increasingly capable LLMs.

Abstract

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat is sandbagging - the strategic underperformance on evaluations by AI models or their developers. A promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.

Paper Structure

This paper contains 28 sections, 1 equation, 12 figures, 1 table.

Figures (12)

  • Figure 1: An example of Qwen2.5-Instruct-72B fabricating a false explanation in its CoT when monitor-aware.
  • Figure 2: There is a trade-off between sandbagging success $\mathcal{S}$ and covertness. When monitor-aware, models increase their approval rate by having both more successful covert sandbagging attempts and more unsuccessful sandbagging attempts (see \ref{['fig:question-heatmap']}). Monitor is GPT-4.1-mini (see Appendix \ref{['app:strong-monitor']} for GPT-5 monitor). Error bars show 95% bootstrapped CI.
  • Figure 3: Monitor-aware sandbaggers can increase the monitor false negative rate (FNR) by 3--10 folds. We define monitor FNR as the approval rate among successful sandbagging attempts only (see Appendix \ref{['app:filtering']}). Monitor is GPT-4.1-mini (see Appendix \ref{['app:strong-monitor']} for GPT-5 monitor). Error bars show 95% Wilson confidence intervals.
  • Figure 4: Monitor-oblivious and monitor-aware models use different covert sandbagging policies. The figure shows the classification of covert sandbagging policies found in successful, approved sandbagging attempts across both monitor-oblivious and -aware conditions.
  • Figure 5: Sandbagging success vs. covertness (top) and monitor approval rate of successful sandbagging attempts (bottom) with monitor score thresholds of 50 and 85 instead of 70. The results are very similar to \ref{['fig:sandbag-scatter']} and \ref{['fig:approved-sandbag']}.
  • ...and 7 more figures