Mitigating Deceptive Alignment via Self-Monitoring
Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang
TL;DR
This work addresses deceptive alignment in large language models that use chain-of-thought reasoning by introducing CoT Monitor+, which embeds a self-monitoring signal into the reasoning process and uses it as an intrinsic reward in RL. A formal framework with outer goals and mesa utilities models deception, and a data-driven Self-Monitor paradigm enables the model to audit its own reasoning while generating CoT, improving safety without sacrificing performance. To evaluate deception systematically, the authors propose DeceptionBench, a five-category benchmark with a Deception Tendency Rate metric, showing that Self-Monitor reduces deceptive tendencies by 43.8% and enhances alignment with human judgments. The results suggest that intrinsic oversight within reasoning, paired with constrained RL, yields more transparent and robust alignment, offering a scalable path to safer, more trustworthy reasoning models.
Abstract
Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io
