Table of Contents
Fetching ...

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

TL;DR

This work evaluates safety monitoring paradigms for autonomous AI by comparing Chain-of-Thought (CoT) monitoring, action-only monitoring, and a hybrid approach in a red-team coding task setting. Using 1,140 BigCodeBench problems paired with side tasks, it demonstrates that CoT improves detection for subtle sabotages but can be misled by plausible reasoning, while action-only monitoring excels on overt threats. A hybrid protocol, combining independent CoT and action signals with a weighted average, consistently outperforms single-monitor strategies across models and tasks, achieving higher detection rates, especially in subtle deception scenarios. The study discusses limitations (e.g., single-step tasks, evaluation awareness) and highlights practical implications for deploying robust monitoring in frontier AI systems.

Abstract

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

TL;DR

This work evaluates safety monitoring paradigms for autonomous AI by comparing Chain-of-Thought (CoT) monitoring, action-only monitoring, and a hybrid approach in a red-team coding task setting. Using 1,140 BigCodeBench problems paired with side tasks, it demonstrates that CoT improves detection for subtle sabotages but can be misled by plausible reasoning, while action-only monitoring excels on overt threats. A hybrid protocol, combining independent CoT and action signals with a weighted average, consistently outperforms single-monitor strategies across models and tasks, achieving higher detection rates, especially in subtle deception scenarios. The study discusses limitations (e.g., single-step tasks, evaluation awareness) and highlights practical implications for deploying robust monitoring in frontier AI systems.

Abstract

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.

Paper Structure

This paper contains 32 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Method Schematic: a prompt including a legitimate main task, a subversive side task and an attack policy is fed to a strong reasoning Large Language Model (LLM). We assess the ability of a weaker LLM monitor to detect the resulting misbehavior, with and without access to the CoT.
  • Figure 2: Example transcripts from DeepSeek-R1 (subtle) and Claude 3.7 Sonnet (non-subtle) as the untrusted models and GPT-4.1-mini as the trusted model. The CoT monitor assigns a high score to the subtle runtime-complexity side task (left), whereas the action-only monitor fails to detect the inefficient implementation. Conversely, for the obvious exfiltrate side task (right), contradictory reasoning misleads the CoT monitor, despite access to the final answer.
  • Figure 3: Catch rates of action-only and CoT monitoring across side tasks, results averaged across untrusted models and attack policies for each side task. The dashed line separates the regions where CoT monitoring outperforms action-only (above), and vice versa (below).
  • Figure 4: Catch rate performance comparison at 1%FPR of all trusted models for action-only, CoT and hybrid monitors. Catch rate performance comparison at 1% FPR of all trusted models for action-only, CoT and hybrid monitors across side tasks. The subplots display results by monitor model as follows: top left shows GPT-4.1-mini, top right Gemini 2.5 Flash Lite, bottom left Claude 3.5 Haiku, and bottom right GPT-4.1-nano.
  • Figure 5: Catch rate performance of GPT-4.1-mini (left) and Gemini 2.5 Flash Lite (right) as monitors grouped by untrusted model and side task subtlety.
  • ...and 6 more figures