Table of Contents
Fetching ...

Who Reasons in the Large Language Models?

Jie Shao, Jianxin Wu

TL;DR

The paper tackles the question of where reasoning emerges in large language models and proposes that the output projection $o_{proj}$ within the Transformer’s MHSA is the key driver. It introduces Stethoscope for Networks (SfN), a diagnostic framework with Delta, Merge, Freeze, and Destruction gadgets to localize reasoning to $o_{proj}$ and distinguish it from conversational capabilities governed by other modules. Empirical evidence includes per-module weight shifts $ orm{w_X(B) - w_X(A)}_{\, orm{2}}$ dominated by $o_{proj}$, a bimodal distribution for $o_{proj}$, and successful level IV reasoning when merging $o_{proj}$ from reasoning models into base models (e.g., on AIME), while other module merges often degrade performance. The work discusses implications for faster, modular, domain-specific LLMs via targeted $o_{proj}$ finetuning and cautions about limitations, generalization, and risks of targeted manipulation, marking a step toward more interpretable and efficient LLM design.

Abstract

Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning--remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer's multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.

Who Reasons in the Large Language Models?

TL;DR

The paper tackles the question of where reasoning emerges in large language models and proposes that the output projection within the Transformer’s MHSA is the key driver. It introduces Stethoscope for Networks (SfN), a diagnostic framework with Delta, Merge, Freeze, and Destruction gadgets to localize reasoning to and distinguish it from conversational capabilities governed by other modules. Empirical evidence includes per-module weight shifts dominated by , a bimodal distribution for , and successful level IV reasoning when merging from reasoning models into base models (e.g., on AIME), while other module merges often degrade performance. The work discusses implications for faster, modular, domain-specific LLMs via targeted finetuning and cautions about limitations, generalization, and risks of targeted manipulation, marking a step toward more interpretable and efficient LLM design.

Abstract

Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning--remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer's multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.

Paper Structure

This paper contains 17 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Stethoscope for Networks. SfN is a framework designed to identify which components of an LLM give rise to specific abilities. By comparing weight changes and observing behaviors under controlled module merging, tuning, or destruction, SfN provides interpretable insights into the origin of capabilities like reasoning.
  • Figure 1: AIME 2024 accuracy of the base model, the reasoning model, and their merged variants. Each merged model is constructed by replacing specific modules in model $A$ with the corresponding module from model $B$.
  • Figure 2: Per-module L2 distance of linear weights between models $A$ and $B$. Notably, the o_proj module shows the second-largest change in 1.5B models, and the largest in 14B and 32B models, highlighting its potential importance for reasoning. Similar trends are observed in 7B and 8B models (see appendix).
  • Figure 3: Layer-wise distribution of relative weight changes between models $A$ and $B$. While most modules display a unimodal distribution, the o_proj module uniquely exhibits a bimodal distribution, highlighting its distinctive behavior. Consistent patterns are observed across models of other sizes, with detailed results provided in the appendix.
  • Figure 4: Four levels of responses generated by the LLM. From level I to level IV, the model exhibits stronger language organization and logical reasoning skills. Each example includes a question (e.g., a math problem from AIME or a typical user-issued request) and the corresponding response generated by the LLM.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Conjecture 1: Division of Labor
  • Conjecture 2: Output Projection Plugin