Table of Contents
Fetching ...

Reasoning Models Sometimes Output Illegible Chains of Thought

Arun Jose

TL;DR

The paper investigates whether outcome-based reinforcement learning induces illegible chain-of-thought traces in 14 reasoning models, with a focus on monitoring efficacy. Using GPQA-Diamond questions and GPT-4o as an autograder, it finds that most RL-based reasoning models (except Claude) produce illegible CoTs while final answers stay readable, and legibility declines on harder tasks. Prefill experiments show illegible reasoning contributes to performance, but resampling reveals no consistent accuracy advantage for illegible CoTs, suggesting a nuanced role for such reasoning. The authors discuss hypotheses including vestigial reasoning and steganography, highlighting implications for designing robust CoT monitoring and safety controls in advanced LLMs.

Abstract

Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

Reasoning Models Sometimes Output Illegible Chains of Thought

TL;DR

The paper investigates whether outcome-based reinforcement learning induces illegible chain-of-thought traces in 14 reasoning models, with a focus on monitoring efficacy. Using GPQA-Diamond questions and GPT-4o as an autograder, it finds that most RL-based reasoning models (except Claude) produce illegible CoTs while final answers stay readable, and legibility declines on harder tasks. Prefill experiments show illegible reasoning contributes to performance, but resampling reveals no consistent accuracy advantage for illegible CoTs, suggesting a nuanced role for such reasoning. The authors discuss hypotheses including vestigial reasoning and steganography, highlighting implications for designing robust CoT monitoring and safety controls in advanced LLMs.

Abstract

Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

Paper Structure

This paper contains 16 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Examples of illegible and legible CoTs generated by DeepSeek-R1. Legibility is graded by having GPT-4o score the transcripts based on how well it understands the text. CoTs are scores from 1-9, with a higher score meaning less legible text. Both final answers shown are graded as correct.
  • Figure 2: Legibility scores of CoTs across 9 representative models. We test DeepSeek R1, R1-Zero, the non-reasoning model DeepSeek V3; R1-Distill-32B and R1-Distill-14B; QwQ; Qwen3-235B; and Claude Sonnet 4 and Opus 4. Legibility is scored from 1-9, with higher scores indicating greater illegibility. R1, R1-Zero, and QwQ have high tendencies to output illegible CoTs. These results suggest that illegible reasoning is likely a consequence of outcome-based RL, and that there may be ways to ameliorate the effect.
  • Figure 3: Model performance with and without illegible CoT text. We truncate QwQ's CoT at the point it starts becoming illegible (score $\geq$ 6), and use it as prefill to another instance of the model with reasoning tags closed, forcing it to give its final answer. Performance goes down noticeably, with answers marked as correct or partially correct dropping and incorrect answers increasing.
  • Figure 4: Legibility scores categorized by question hardness. All reasoning models output more illegible CoTs for harder questions. The scores are normalized for CoT length, to account for longer CoTs tending to be less legible — without normalization, the effect is stronger. Question hardness is determined by whether R1 at temperature 0 got them correct, partially correct, or incorrect, across multiple generations. These results suggest that illegible CoTs may be useful to the model, and that harder tasks make it "try harder".
  • Figure 5: Per-question correlations between CoT illegibility and answer correctness for QwQ. Each point represents the Pearson correlation coefficient for a single question across 100 samples. The average correlation is near zero (mean $r = 0.061$), indicating no consistent relationship between illegibility and correctness. Individual questions show varying correlations in both directions, but these do not aggregate to a population-level effect.
  • ...and 3 more figures