Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song
TL;DR
This work interrogates the faithfulness of chain-of-thought reasoning in large language models by distinguishing true-thinking steps from decorative ones using a causal framework. It introduces the True-Thinking Score (TTS), built from two context-based Average Treatment Effects, and identifies a TrueThinking direction in latent space that can be steered to force or suppress internal reasoning about specific steps. Empirical results show that CoT contains a small subset of genuinely causal steps amidst many decoratives, and that self-verification steps can be decorative. The latent-steering experiments provide a principled indirect validation of faithfulness metrics and reveal a mechanism to manipulate internal reasoning, with implications for efficiency and safety monitoring of CoT approaches.
Abstract
Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model's internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don't truly contribute to LLMs' prediction. We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model's prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.
