Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
Donald Ye, Max Loffgren, Om Kotadia, Linus Wong
TL;DR
This paper tackles whether chain-of-thought explanations reflect genuine reasoning or post-hoc rationalizations. It introduces Normalized Logit Difference Decay (NLDD), a logit-space, architecture-agnostic metric that quantifies step-level faithfulness by measuring how corruption of individual reasoning steps alters final-prediction confidence, normalized by per-model variability. Complementary diagnostics—Representational Similarity Analysis (RSA) and Trajectory Alignment Score (TAS)—assess internal representations and the geometry of reasoning trajectories under counterfactual perturbations. Across three benchmarks (Dyck-$n$, PrOntoQA, GSM8K) and three model families (DeepSeek, Llama, Gemma), the study reveals a consistent Reasoning Horizon $k^*$ around 70–85% of chain length, beyond which additional steps have little or negative causal influence, and it uncovers a Mapping Gap where internal task structure is encoded yet not utilized for prediction. The findings demonstrate that accuracy alone is insufficient to infer genuine reasoning, and NLDD provides a practical diagnostic to evaluate when CoT matters and how to prune or rethink chain-based explanations in practice.
Abstract
Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.
