CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong
TL;DR
This work evaluates safety monitoring paradigms for autonomous AI by comparing Chain-of-Thought (CoT) monitoring, action-only monitoring, and a hybrid approach in a red-team coding task setting. Using 1,140 BigCodeBench problems paired with side tasks, it demonstrates that CoT improves detection for subtle sabotages but can be misled by plausible reasoning, while action-only monitoring excels on overt threats. A hybrid protocol, combining independent CoT and action signals with a weighted average, consistently outperforms single-monitor strategies across models and tasks, achieving higher detection rates, especially in subtle deception scenarios. The study discusses limitations (e.g., single-step tasks, evaluation awareness) and highlights practical implications for deploying robust monitoring in frontier AI systems.
Abstract
As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.
