Reasoning Models Sometimes Output Illegible Chains of Thought
Arun Jose
TL;DR
The paper investigates whether outcome-based reinforcement learning induces illegible chain-of-thought traces in 14 reasoning models, with a focus on monitoring efficacy. Using GPQA-Diamond questions and GPT-4o as an autograder, it finds that most RL-based reasoning models (except Claude) produce illegible CoTs while final answers stay readable, and legibility declines on harder tasks. Prefill experiments show illegible reasoning contributes to performance, but resampling reveals no consistent accuracy advantage for illegible CoTs, suggesting a nuanced role for such reasoning. The authors discuss hypotheses including vestigial reasoning and steganography, highlighting implications for designing robust CoT monitoring and safety controls in advanced LLMs.
Abstract
Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.
