Investigating CoT Monitorability in Large Reasoning Models
Shu Yang, Junchao Wu, Xilin Gong, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
TL;DR
This work systematically investigates the monitorability of chain-of-thought (CoT) in Large Reasoning Models by separating two core challenges—verbalization fidelity and monitor reliability—across math, science, and ethics tasks. It introduces formal metrics (CIR, AKR, VR, Robustness, Scheming, MFR, EEMR, OSM) and adversarial cue prompts to quantify how models verbalize decision factors and how monitors detect misbehavior. The study finds that CoT monitors exhibit systematic over-sensitivity and that verbalization does not trivially predict monitor effectiveness; longer, more reflective reasoning can improve robustness but also reveal new vulnerabilities under CoT interventions. To address these issues, the authors propose MoME, a framework where monitors output structured, evidence-backed JSON judgments, trained with Direct Preference Optimization (DPO) to balance false positives and false negatives. Empirically, MoME-based monitoring with DPO achieves superior Monitor Effectiveness Scores across multiple cue types, suggesting a promising direction for safer deployment of LRMs with transparent, evidence-based supervision.
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models' long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models' misbehavior through their CoT and provide structured judgments along with supporting evidence.
