Table of Contents
Fetching ...

Implicit Reasoning in Transformers is Reasoning through Shortcuts

Tianhe Lin, Jian Xie, Siyu Yuan, Deqing Yang

TL;DR

<3-5 sentence high-level summary> This paper interrogates why implicit reasoning in transformers often fails to exhibit advanced, stepwise capabilities and whether internal stepwise reasoning can emerge without explicit CoT. Using a synthetic multi-step arithmetic dataset, activation patching, and RoPE-enhanced GPT-2, it shows that stepwise reasoning can emerge when training data follows fixed patterns, but generalization collapses when premise order is unfixed due to shortcut learning such as number-chaining. The authors extend the investigation to state-of-the-art LLMs and demonstrate that these models similarly rely on shortcuts under unfixed patterns, revealing a fundamental limitation of current implicit reasoning. The work highlights the need for training regimes and architectures that promote genuine variable-tracking and robust, order-agnostic reasoning beyond shortcut-based strategies.

Abstract

Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.

Implicit Reasoning in Transformers is Reasoning through Shortcuts

TL;DR

<3-5 sentence high-level summary> This paper interrogates why implicit reasoning in transformers often fails to exhibit advanced, stepwise capabilities and whether internal stepwise reasoning can emerge without explicit CoT. Using a synthetic multi-step arithmetic dataset, activation patching, and RoPE-enhanced GPT-2, it shows that stepwise reasoning can emerge when training data follows fixed patterns, but generalization collapses when premise order is unfixed due to shortcut learning such as number-chaining. The authors extend the investigation to state-of-the-art LLMs and demonstrate that these models similarly rely on shortcuts under unfixed patterns, revealing a fundamental limitation of current implicit reasoning. The work highlights the need for training regimes and architectures that promote genuine variable-tracking and robust, order-agnostic reasoning beyond shortcut-based strategies.

Abstract

Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.

Paper Structure

This paper contains 39 sections, 1 equation, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: A failure of generalization in language models trained on data with unfixed patterns, namely "Variable as Subtrahend Plight". When trained on unfixed premise order, the model learns a reasoning shortcut that benefits from addition commutativity. This shortcut enables the model to perform implicit reasoning by chaining numbers, which fails when variables are subtrahends.
  • Figure 2: Test accuracy during the training stage. We find that Transformers are able to learn to reason implicitly and generalize well to those that require longer reasoning steps.
  • Figure 3: Activation patching on residual stream across layers and token positions when changing the first number in the problems. All the premise orders are forward.
  • Figure 4: Test accuracy under different attention window sizes on 5-step problems. A window size of $n$ means that a token can focus on itself and its preceding $n-1$ tokens.
  • Figure 5: Patching effect of different components across layers and token positions. We change the numbers in the first two steps. The result of step 2 is changed in sub-figure (a)(b)(c), while the result is kept unchanged in sub-figure (d)(e)(f). A deeper color indicates the significance of activation at that position. We add a green rectangle in the figure to better illustrate the location where the patching effect first starts to diminish.
  • ...and 13 more figures