How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
Xi Chen, Aske Plaat, Niki van Stein
TL;DR
This work investigates whether Chain-of-Thought prompts reflect true internal reasoning in large language models. It introduces a feature-level causal framework that combines sparse autoencoders with activation patching to extract and manipulate monosemantic internal features, enabling direct testing of CoT faithfulness on GSM8K problems. The results show that CoT prompts induce more interpretable, sparsified, and causally effective feature representations in the larger Pythia-2.8B model, while smaller models exhibit minimal or unstable benefits, suggesting a scale-dependent mechanism for CoT efficacy. Overall, the study demonstrates CoT as a structuring prompt that reshapes internal computations, with implications for mechanistic interpretability and faithful reasoning in high-capacity LLMs.
Abstract
Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.
