Table of Contents
Fetching ...

Understanding Reasoning in Thinking Language Models via Steering Vectors

Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

TL;DR

The paper addresses how to control internal reasoning in thinking LLMs by introducing steering vectors that operate in activation-space directions. It defines a causal framework using Activation Patch- ing and the Difference of Means to extract vectors that modulate behaviors such as backtracking, uncertainty-estimation, and example testing, with validation across 500 tasks and three model sizes within the DeepSeek-R1-Distill family. The authors demonstrate inference-time, interpretable control by adding or subtracting steering vectors to residual activations, supported by attribution-patching for layer selection and consistent behavior shifts across models. These findings advance fine-grained, task-adaptive manipulation of thinking processes in LLMs and highlight both practical utility and avenues for future generalization and robustness, including annotation reliability and broader model applicability.

Abstract

Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

Understanding Reasoning in Thinking Language Models via Steering Vectors

TL;DR

The paper addresses how to control internal reasoning in thinking LLMs by introducing steering vectors that operate in activation-space directions. It defines a causal framework using Activation Patch- ing and the Difference of Means to extract vectors that modulate behaviors such as backtracking, uncertainty-estimation, and example testing, with validation across 500 tasks and three model sizes within the DeepSeek-R1-Distill family. The authors demonstrate inference-time, interpretable control by adding or subtracting steering vectors to residual activations, supported by attribution-patching for layer selection and consistent behavior shifts across models. These findings advance fine-grained, task-adaptive manipulation of thinking processes in LLMs and highlight both practical utility and avenues for future generalization and robustness, including annotation reliability and broader model applicability.

Abstract

Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

Paper Structure

This paper contains 21 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Steering on DeepSeek-R1's backtracking feature vector changes the model's behavior. Depending on whether we add or subtract this vector to the activations at inference time, the model increases or decreases its tendency to abandon its current approach and explore alternative strategies for the task at hand. Highlighted sections indicate instances of this behavior.
  • Figure 2: Comparison of behavioral patterns across five DeepSeek-R1-Distill models and five baseline models on $100$ randomly selected tasks from our dataset (cf. \ref{['subsec:setup']}). The plot on the left shows the fraction of sentences annotated with each behavioral category. The plot on the right shows the average number of sentences per response. Thinking models generate substantially longer responses ($27.6$ vs $14.4$ sentences on average) and exhibit a higher fractions of backtracking, uncertainty estimation and example testing behaviors, but lower fractions of knowledge augmentation.
  • Figure 3: Causal impact of candidate steering vectors across model layers. The y-axis represents the absolute mean KL-divergence for the next-token logit distribution when removing the steering vector at each layer. The steering vectors for all reasoning mechanisms have similar peaks in the middle layers of the respective models.
  • Figure 4: Effect of applying the steering vector for each reasoning behavior across different distill models. The y-axis shows the change in the fraction of tokens exhibiting each behavior when applying positive or negative steering. Positive steering increases behaviors such as backtracking and uncertainty estimation, while negative steering suppresses or significantly reduces them, confirming the causal influence of our extracted vectors.
  • Figure 5: Cosine similarity heatmaps between steering vectors for different reasoning behaviors. The heatmaps show pairwise similarities between feature vectors extracted for five behavioral categories. Values range from -1 (completely opposite) to 1 (identical direction), with colors indicating the strength and direction of similarity. Most behaviors show low to moderate similarities, indicating they represent distinct reasoning mechanisms in the model's activation space.
  • ...and 1 more figures