Table of Contents
Fetching ...

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Richard J. Young

Abstract

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Abstract

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.
Paper Structure (44 sections, 1 equation, 7 figures, 8 tables)

This paper contains 44 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Experimental pipeline. 498 questions are paired with 12 models for baseline runs, then augmented with 6 hint types for hinted runs (41,832 total inference calls). The 10,276 influenced cases (where the model changed its answer to match the hint) are classified by two independent systems.
  • Figure 2: Hint influence rate (%) by model and hint type. Darker cells indicate higher susceptibility to the hint. Qwen3.5-27B shows the highest average influence rate (44.6%), while MiniMax-M2.5 shows the lowest (20.2%).
  • Figure 3: Faithfulness rate (%) by model and hint type, as assessed by the Sonnet judge. Values represent the proportion of hint-influenced responses where the CoT explicitly acknowledges the hint. Faithfulness ranges from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale).
  • Figure 4: Sonnet-judged faithfulness rates by hint type with sample sizes. Consistency shows the lowest faithfulness (35.5%), while unethical shows the highest (79.4%). The dashed line indicates the overall average (69.7%).
  • Figure 5: Faithfulness rate vs. active parameter count (log scale). Each point is one model. The dashed trend line ($R^2 = 0.07$) confirms no strong linear relationship between scale and faithfulness.
  • ...and 2 more figures