Table of Contents
Fetching ...

The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

Yihuai Hong, Dian Zhou, Meng Cao, Lei Yu, Zhijing Jin

TL;DR

This work identifies Linear Reasoning Features (LiReFs) as linear directions in the residual stream of decoder-only transformers that mediate the balance between reasoning and memorization in LLMs. LiReFs are extracted via a difference of means between reasoning- and memory-oriented inputs, enabling both diagnostic visualization and causal intervention; during inference, adding or ablating along the LiReF direction with a scalar α shifts the model toward more generalizable reasoning or memorization. Across four base models and six datasets, LiReFs consistently separate reasoning from memory and correlate with reasoning generalizability (e.g., between reasoning scores and LiReF activation). Inference-time LiReF interventions yield improved accuracy on reasoning tasks and reduced misapplication of memory-based approaches, suggesting a mechanistic, transferable control knob for robust and interpretable generative reasoning in LLMs. The results point to a principled path toward more predictable and controllable AI systems leveraging the internal geometry of activation spaces.

Abstract

Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.

The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

TL;DR

This work identifies Linear Reasoning Features (LiReFs) as linear directions in the residual stream of decoder-only transformers that mediate the balance between reasoning and memorization in LLMs. LiReFs are extracted via a difference of means between reasoning- and memory-oriented inputs, enabling both diagnostic visualization and causal intervention; during inference, adding or ablating along the LiReF direction with a scalar α shifts the model toward more generalizable reasoning or memorization. Across four base models and six datasets, LiReFs consistently separate reasoning from memory and correlate with reasoning generalizability (e.g., between reasoning scores and LiReF activation). Inference-time LiReF interventions yield improved accuracy on reasoning tasks and reduced misapplication of memory-based approaches, suggesting a mechanistic, transferable control knob for robust and interpretable generative reasoning in LLMs. The results point to a principled path toward more predictable and controllable AI systems leveraging the internal geometry of activation spaces.

Abstract

Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.

Paper Structure

This paper contains 34 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Main findings of our study: (a) There exists a set of linear features (LiReFs) in the LLM residual stream that drives the model to switch between reasoning and memorization modes with different levels of generalizability. (b) LiReFs generally explain model reasoning capability across various knowledge domains and languages. (c) Model activation values along LiReFs correlate strongly with model generalizability on reasoning tasks. (d) Intervening LiReFs during inference time can further improve the model reasoning performance and generalizability.
  • Figure 2: Visualization of the hidden states of four base models using 2-dimensional PCA. For each model, we plot six groups of points across several datasets. We observe that: (1) For all four models, questions defined as Reasoning-required and those defined as Memory-required can be naturally distinguished into two distinct groups, as shown by the boundary (grey dashed line) fitted via logistic regression, with the blue arrows showing the approximate direction of the Linear Reasoning Features. (2) In the extracted dimensions, the influence of task domain and language within the same category on the distribution is not significant, and data requiring the same capability naturally cluster together in the same region.
  • Figure 3: Layerwise cosine similarity between the last token residual stream activations and the extracted Linear Reasoning Features (LiReFs) in four base models and their corresponding instruction-tuned variants.
  • Figure 4: Strong correlation between Projection Values on the Linear Reasoning Features (LiReFs) direction and the Reasoning Score provided by GPT-4o, with Spearman coefficients of 0.840 (LLaMA3-8B-base) and 0.752 (Mistral-7B-v0.3-base). The LiReFs projections exhibit a spectrum-like distribution, where continuous increases in Reasoning Scores correspond to progressively rising Projection Values along the LiReFs direction.
  • Figure 5: Visualization of the hidden states of two base models on the datasets of MBPP, HumanEval, MMLU-Pro-M and MMLU-Pro-R using 2-dimensional PCA. The hidden states of coding tasks, which involve both reasoning and memory recall, are positioned around the boundary (grey dashed line) fitted via logistic regression.
  • ...and 5 more figures