Table of Contents
Fetching ...

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

Zidi Xiong, Shan Chen, Himabindu Lakkaraju

TL;DR

This work systematically characterizes how monitorability of reasoning traces emerges during RLVR-based training of large reasoning models. It demonstrates that monitorability gains are not universal but depend strongly on data distribution, especially instruction-following data, and can transfer across domains with correlations strongest among Science, Code, and General tasks. Mechanistically, gains arise more from distribution sharpening (entropy reduction) and prompt-focused attention than from deeper, more faithful reasoning tracing, and early training phases exhibit the largest improvements. The study validates monitorability with Draft-to-Answer Faithfulness and shows that training length and task difficulty modulate the effect, offering guidance for safer, auditable reasoning in LRMs and highlighting when monitorability is likely to emerge or fade.

Abstract

As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning

TL;DR

This work systematically characterizes how monitorability of reasoning traces emerges during RLVR-based training of large reasoning models. It demonstrates that monitorability gains are not universal but depend strongly on data distribution, especially instruction-following data, and can transfer across domains with correlations strongest among Science, Code, and General tasks. Mechanistically, gains arise more from distribution sharpening (entropy reduction) and prompt-focused attention than from deeper, more faithful reasoning tracing, and early training phases exhibit the largest improvements. The study validates monitorability with Draft-to-Answer Faithfulness and shows that training length and task difficulty modulate the effect, offering guidance for safer, auditable reasoning in LRMs and highlighting when monitorability is likely to emerge or fade.

Abstract

As Large Reasoning Models (LRMs) are increasingly deployed, auditing their chain-of-thought (CoT) traces for safety becomes critical. Recent work has reported that monitorability--the degree to which CoT faithfully and informatively reflects internal computation--can appear as a "free gift" during the early stages of Reinforcement Learning with Verifiable Rewards (RLVR). We make this observation concrete through a systematic evaluation across model families and training domains. Our results show that this effect is not universal: monitorability improvements are strongly data-dependent. In particular, we demonstrate the critical role of data diversity and instruction-following data during RLVR training. We further show that monitorability is orthogonal to capability--improvements in reasoning performance do not imply increased transparency. Through mechanistic analysis, we attribute monitorability gains primarily to response distribution sharpening (entropy reduction) and increased attention to the prompt, rather than stronger causal reliance on reasoning traces. We also reveal how monitorability dynamics vary with controlled training and evaluation difficulty. Together, these findings provide a holistic view of how monitorability emerges under RLVR, clarifying when gains are likely to occur and when they are not.
Paper Structure (67 sections, 10 equations, 12 figures, 7 tables)

This paper contains 67 sections, 10 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Trajectory of Monitorability ($\text{g-mean}^2$) across training steps for two models. Top row: DeepSeek-R1-Distill-Qwen-1.5B. Bottom row: Qwen3-4B. Each column corresponds to a distinct test distribution. "w/o IF" here represents "All without IF". We filtered out cases where the intervention caused meaningful behavior changes in less than 5% of instances (as defined in Appendix \ref{['app:metric_g2mean']}).
  • Figure 2: Pearson correlation of monitorability ($\text{g-mean}^2$) across different test domains.
  • Figure 3: Top: Overall average capability across all capability benchmarks during training. Bottom: Pearson correlations between task performance and monitorability.
  • Figure 4: Instruction-Following capability vs. monitorability. Top: DeepSeek-R1-Distill-Qwen-1.5B, Bottom: Qwen3-4B. Model checkpoints are grouped by training data distribution. For models not trained on Instruction-Following data, monitorability is less correlated with Instruction-Following capability.
  • Figure 5: Response entropy vs. monitorability.
  • ...and 7 more figures