Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye; Max Loffgren; Om Kotadia; Linus Wong

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye, Max Loffgren, Om Kotadia, Linus Wong

TL;DR

This paper tackles whether chain-of-thought explanations reflect genuine reasoning or post-hoc rationalizations. It introduces Normalized Logit Difference Decay (NLDD), a logit-space, architecture-agnostic metric that quantifies step-level faithfulness by measuring how corruption of individual reasoning steps alters final-prediction confidence, normalized by per-model variability. Complementary diagnostics—Representational Similarity Analysis (RSA) and Trajectory Alignment Score (TAS)—assess internal representations and the geometry of reasoning trajectories under counterfactual perturbations. Across three benchmarks (Dyck-$n$, PrOntoQA, GSM8K) and three model families (DeepSeek, Llama, Gemma), the study reveals a consistent Reasoning Horizon $k^*$ around 70–85% of chain length, beyond which additional steps have little or negative causal influence, and it uncovers a Mapping Gap where internal task structure is encoded yet not utilized for prediction. The findings demonstrate that accuracy alone is insufficient to infer genuine reasoning, and NLDD provides a practical diagnostic to evaluate when CoT matters and how to prune or rethink chain-based explanations in practice.

Abstract

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

TL;DR

, PrOntoQA, GSM8K) and three model families (DeepSeek, Llama, Gemma), the study reveals a consistent Reasoning Horizon

around 70–85% of chain length, beyond which additional steps have little or negative causal influence, and it uncovers a Mapping Gap where internal task structure is encoded yet not utilized for prediction. The findings demonstrate that accuracy alone is insufficient to infer genuine reasoning, and NLDD provides a practical diagnostic to evaluate when CoT matters and how to prune or rethink chain-based explanations in practice.

Abstract

Paper Structure (48 sections, 7 equations, 8 figures, 6 tables)

This paper contains 48 sections, 7 equations, 8 figures, 6 tables.

Introduction
Related Work
Mechanistic Framework and Experimental Design
Task Design and Dataset Construction
Benchmarks
Counterfactual Construction.
NLDD
Global Calibration.
Logit Difference.
Faithfulness Quantification.
RSA
Representational Dissimilarity Matrices (RDM)
Temporal Analysis.
Similarity Quantification.
TAS
...and 33 more sections

Figures (8)

Figure 1: NLDD reveals divergent faithfulness regimes. Both models maintain stable RSA, indicating consistent internal representations. Yet causal dependence differs: DeepSeek relies on its reasoning chain, while Gemma's accuracy improves when reasoning is corrupted.
Figure 2: The spectrum of ambiguity across reasoning tasks.
Figure 3: LLaMA-3.1-8B: NLDD, RSA, and TAS as a function of corruption step index $k$ across tasks.
Figure 4: LLaMA-3.1-8B: robustness diagnostics under counterfactual step corruption.
Figure 5: DeepSeek-Coder-6.7B: NLDD, RSA, and TAS as a function of corruption step index $k$ across tasks.
...and 3 more figures

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

TL;DR

Abstract

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)