Table of Contents
Fetching ...

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen, Aske Plaat, Niki van Stein

TL;DR

This work investigates whether Chain-of-Thought prompts reflect true internal reasoning in large language models. It introduces a feature-level causal framework that combines sparse autoencoders with activation patching to extract and manipulate monosemantic internal features, enabling direct testing of CoT faithfulness on GSM8K problems. The results show that CoT prompts induce more interpretable, sparsified, and causally effective feature representations in the larger Pythia-2.8B model, while smaller models exhibit minimal or unstable benefits, suggesting a scale-dependent mechanism for CoT efficacy. Overall, the study demonstrates CoT as a structuring prompt that reshapes internal computations, with implications for mechanistic interpretability and faithful reasoning in high-capacity LLMs.

Abstract

Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

TL;DR

This work investigates whether Chain-of-Thought prompts reflect true internal reasoning in large language models. It introduces a feature-level causal framework that combines sparse autoencoders with activation patching to extract and manipulate monosemantic internal features, enabling direct testing of CoT faithfulness on GSM8K problems. The results show that CoT prompts induce more interpretable, sparsified, and causally effective feature representations in the larger Pythia-2.8B model, while smaller models exhibit minimal or unstable benefits, suggesting a scale-dependent mechanism for CoT efficacy. Overall, the study demonstrates CoT as a structuring prompt that reshapes internal computations, with implications for mechanistic interpretability and faithful reasoning in high-capacity LLMs.

Abstract

Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

Paper Structure

This paper contains 19 sections, 5 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Workflow of the approach: After SAE, we do Activation patching, Feature Interpretation, and Activation Sparsity Analysis. All three confirm that CoT improves faithfulness of reasoning
  • Figure 2: Comparison of feature explanation scores under CoT and NoCoT prompts. Left: Pythia-70M; Right: Pythia-2.8B. The 2.8B model shows higher explanation scores under CoT, indicating stronger causal features are learned in the larger model when CoT prompting is applied. Each plot is based on 50 features per condition.
  • Figure 3: Distribution of log-probability changes after patching the top 20 CoT features into NoCoT runs under dictionary ratio 4. Left: Pythia-70M; Right: Pythia-2.8B. While 2.8B shows a strong positive shift indicating consistent benefit from CoT features, 70M shows highly variable effects, including large performance drops, suggesting unstable or less effective feature transfer.
  • Figure 4: Distribution of log-probability changes after patching the top 20 CoT features into NoCoT runs under dictionary ratio 8. Left: Pythia-70M; Right: Pythia-2.8B. Compared to ratio 4, the distributions are similar: 2.8B continues to show consistent improvements, while 70M remains less robust, exhibiting high variance and frequent negative effects.
  • Figure 5: Top-$K$ and Random-$K$ patching performance under dictionary ratio 4. Left: Pythia-70M; Right: Pythia-2.8B. CoT$\rightarrow$NoCoT patching shows the effect of patching CoT features into NoCoT, while NoCoT$\rightarrow$CoT patching shows the reverse. In 2.8B, patching CoT features yields consistent performance gains, highlighting their causal importance. In contrast, for 70M, patching CoT features leads to a substantial and monotonic performance decline, suggesting that CoT-induced features are ineffective or even harmful in the smaller model ($p < 0.001$).
  • ...and 11 more figures