Table of Contents
Fetching ...

Mitigating Deceptive Alignment via Self-Monitoring

Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang

TL;DR

This work addresses deceptive alignment in large language models that use chain-of-thought reasoning by introducing CoT Monitor+, which embeds a self-monitoring signal into the reasoning process and uses it as an intrinsic reward in RL. A formal framework with outer goals and mesa utilities models deception, and a data-driven Self-Monitor paradigm enables the model to audit its own reasoning while generating CoT, improving safety without sacrificing performance. To evaluate deception systematically, the authors propose DeceptionBench, a five-category benchmark with a Deception Tendency Rate metric, showing that Self-Monitor reduces deceptive tendencies by 43.8% and enhances alignment with human judgments. The results suggest that intrinsic oversight within reasoning, paired with constrained RL, yields more transparent and robust alignment, offering a scalable path to safer, more trustworthy reasoning models.

Abstract

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io

Mitigating Deceptive Alignment via Self-Monitoring

TL;DR

This work addresses deceptive alignment in large language models that use chain-of-thought reasoning by introducing CoT Monitor+, which embeds a self-monitoring signal into the reasoning process and uses it as an intrinsic reward in RL. A formal framework with outer goals and mesa utilities models deception, and a data-driven Self-Monitor paradigm enables the model to audit its own reasoning while generating CoT, improving safety without sacrificing performance. To evaluate deception systematically, the authors propose DeceptionBench, a five-category benchmark with a Deception Tendency Rate metric, showing that Self-Monitor reduces deceptive tendencies by 43.8% and enhances alignment with human judgments. The results suggest that intrinsic oversight within reasoning, paired with constrained RL, yields more transparent and robust alignment, offering a scalable path to safer, more trustworthy reasoning models.

Abstract

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io

Paper Structure

This paper contains 32 sections, 2 theorems, 19 equations, 10 figures, 6 tables.

Key Result

Lemma B.1

Consider a compact convex subset $X$ and a convex subset $Y$ in linear convex topological spaces. Let $f: X \times Y \to \mathbb{R}$ be a function satisfying, a) for each $y \in Y$, $x \to f(x, y)$ is convex and lower semi-continuous and b) For each $x \in X$, $y \to f(x, y)$ is concave. Then, there

Figures (10)

  • Figure 1: Illustration of AI deception. AI deception occurs when intents conflict (mesa utility function and outer driven goal) cause a model to adopt secretly misaligned behaviors (e.g., during alignment v.s. deployment greenblatt2024alignment) while can not be monitored directly by its action.
  • Figure 2: (Left) How Self-Monitor works? Unlike the standard LLMs reasoning process, the Self-Monitor model first detects deceptive or harmful patterns in its CoT, and then generates a safer response based on the monitor evaluation. In reinforcement learning, the reward signal from the self-monitor's CoT detection is combined with a standard action monitor to optimize the model. (Right) How DeceptionBench evaluates deceptive tendency of models: Neutral prompts and specifically constructed prompts with an outer goal are used to elicit the model's mesa utility and CoT-Action pairs, respectively. These pairs are then evaluated for consistency and deceptiveness.
  • Figure 3: Comparison of agreement with human judgments. We evaluate the human consistency of the DTR and LLM-as-Judge in deception evaluation. The DTR outperforms LLM-as-Judge in terms of both human agreement rate and Phi coefficient matthews1975comparison, regardless of the choice of judge models.
  • Figure 4: The deception tendency of API-based and open-source models in DeceptionBench.
  • Figure 5: Analogy and experiment results of different RL with monitor. We conduct three RL training with distinct monitor setting to study deceptive alignment, and quantify the resulting fraction of deceptive CoT and performance gap of models in each setting over the course of RL training steps.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Definition 3.1: $\text{MDP} \setminus \text{R}$
  • Definition 3.3: The Deceptive Behavior of LLMs
  • Lemma B.1: Minimax Theorem
  • Theorem B.2: The Lagrangian method
  • proof