Table of Contents
Fetching ...

Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song

TL;DR

This work interrogates the faithfulness of chain-of-thought reasoning in large language models by distinguishing true-thinking steps from decorative ones using a causal framework. It introduces the True-Thinking Score (TTS), built from two context-based Average Treatment Effects, and identifies a TrueThinking direction in latent space that can be steered to force or suppress internal reasoning about specific steps. Empirical results show that CoT contains a small subset of genuinely causal steps amidst many decoratives, and that self-verification steps can be decorative. The latent-steering experiments provide a principled indirect validation of faithfulness metrics and reveal a mechanism to manipulate internal reasoning, with implications for efficiency and safety monitoring of CoT approaches.

Abstract

Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model's internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don't truly contribute to LLMs' prediction. We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model's prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.

Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

TL;DR

This work interrogates the faithfulness of chain-of-thought reasoning in large language models by distinguishing true-thinking steps from decorative ones using a causal framework. It introduces the True-Thinking Score (TTS), built from two context-based Average Treatment Effects, and identifies a TrueThinking direction in latent space that can be steered to force or suppress internal reasoning about specific steps. Empirical results show that CoT contains a small subset of genuinely causal steps amidst many decoratives, and that self-verification steps can be decorative. The latent-steering experiments provide a principled indirect validation of faithfulness metrics and reveal a mechanism to manipulate internal reasoning, with implications for efficiency and safety monitoring of CoT approaches.

Abstract

Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model's internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don't truly contribute to LLMs' prediction. We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model's prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.

Paper Structure

This paper contains 36 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: We find that reasoning steps in CoT may not always be true thinking but function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (known as "Aha moments" where LLMs rethink their solution with phrases like "wait"), we randomly perturb the numerical values in the reasoning steps preceding the "Aha moment", and then re-prompt the model for the answer using the modified CoT. In the left example, although the model’s self-verification reasoning is correct, it ignores it and outputs the wrong answer after perturbation. In the right example, the model follows its self-verification and produces the correct result.
  • Figure 2: (a) Illustration of different modes in thinking steps within chain-of-thought (CoT) reasoning. Contrary to the naive view that a step’s faithfulness depends solely on whether perturbing it directly changes the final result, we show that the relationship is more nuanced. A true thinking step may operate in either an AND or OR mode when interacting with other steps. In both cases, such steps contribute meaningfully to the final answer. (b) Based on this understanding, we define the True Thinking Score, which jointly considers two complementary evaluations: the necessity test (high for AND-like steps) and the sufficiency test (high for OR-like steps).
  • Figure 3: We uncover the TrueThinking direction in LLMs which is extracted as the difference between the mean hidden states of true-thinking steps and decorative-thinking steps. Steering the hidden states of each token in a step along this direction induces the model to truly think over that step in latent space.
  • Figure 4: (a) The dataset-level distribution of the TTS score; (b) The distribution for ATE($c=1$) and ATE($c=0$) where low means ATE($\cdot$) is below mean and high means ATE($\cdot$) is above mean; (c) An example CoT case for TTS and the average TTS at different step percentile (normalized).
  • Figure 5: An example of unfaithful self-verification steps (highlighted in blue) where the TTS score of each step is found smaller than 0.005. Low TTS indicates that those steps are not truly engaged in computation; rather, these reasoning steps are likely to be decorative and function as an appearance of self-verification, contributing minimally to the model's final prediction.
  • ...and 5 more figures