Table of Contents
Fetching ...

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

Viacheslav Sinii, Nikita Balagansky, Gleb Gerasimov, Daniil Laptev, Yaroslav Aksenov, Vadim Kurochkin, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov

TL;DR

The paper investigates how reinforcement-learning–based reasoning training reshapes internal computations of LLMs by injecting lightweight steering vectors $s_\ \ell$ into the residual stream of frozen bases. Using mechanistic interpretability tools (logit-lens, path-patching) across two math-capable models, it reveals that final-layer steering acts as a first-token substitution bias while penultimate-layer steering primarily modulates the MLP, with mid-layers suppressing non-English tokens; steering vectors are composable and transferable across related models, and adaptive token-wise magnitude further sharpens activation patterns. DiffSAE-based analyses connect steering-induced activation changes to features correlated with correct generations, and transfer experiments show directional alignment of latent vectors within model families. The work demonstrates that small, additive perturbations can reproduce much of the gains of full fine-tuning, offering concrete insights for activation engineering and deeper understanding of reasoning processes in LLMs. These findings hold practical significance for designing targeted interventions to shape reasoning without full retraining, while also highlighting model- and template-specific differences that warrant further cross-architecture study.

Abstract

The mechanisms by which reasoning training reshapes LLMs' internal computations remain unclear. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

TL;DR

The paper investigates how reinforcement-learning–based reasoning training reshapes internal computations of LLMs by injecting lightweight steering vectors into the residual stream of frozen bases. Using mechanistic interpretability tools (logit-lens, path-patching) across two math-capable models, it reveals that final-layer steering acts as a first-token substitution bias while penultimate-layer steering primarily modulates the MLP, with mid-layers suppressing non-English tokens; steering vectors are composable and transferable across related models, and adaptive token-wise magnitude further sharpens activation patterns. DiffSAE-based analyses connect steering-induced activation changes to features correlated with correct generations, and transfer experiments show directional alignment of latent vectors within model families. The work demonstrates that small, additive perturbations can reproduce much of the gains of full fine-tuning, offering concrete insights for activation engineering and deeper understanding of reasoning processes in LLMs. These findings hold practical significance for designing targeted interventions to shape reasoning without full retraining, while also highlighting model- and template-specific differences that warrant further cross-architecture study.

Abstract

The mechanisms by which reasoning training reshapes LLMs' internal computations remain unclear. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

Paper Structure

This paper contains 34 sections, 10 equations, 26 figures, 7 tables.

Figures (26)

  • Figure 1: Single-layer steering. Mean accuracy on six benchmarks for Qwen2.5-Math-7B when training a single vector $s_\ell$ at layer $\ell$ with all other layers frozen. Mid-layer vectors yield the largest gains but never match all-layer steering, indicating the improvement is distributed across layers.
  • Figure 2: Steering Vector Persistence. For each steering layer $i$ (color encodes $i$; warm = early, cool = late) and each target layer $\ell$ on the $x$-axis, we compute the mean cosine similarity of the per-token change in hidden representations $\Delta F_{<\ell,i}$. Left: similarity between $\Delta F_{<l,i}(x)$ and the dataset mean $\mathbb{E}_x[\Delta F_{<l,i}(x)]$, showing how aligned the per-token shifts are. Right: similarity between $\Delta F_{<l,i}(x)$ and the layer-$\ell$ steering vector $s_\ell$, showing the alignment of the shifts with the layer's own steering vector.
  • Figure 3: Last-layer analysis. Left: the last-layer vector mainly boosts the initial token "To". Right: prefixing that token reproduces most of the observed performance gain.
  • Figure 4: Similarity of steering-induced unembedding biases. Each cell shows the cosine similarity between the average final-layer shifts $\mathbb{E}[\Delta F_{<L,i}]$ and $\mathbb{E}[\Delta F_{<L,j}]$ induced by steering at layers $i$ and $j$. High similarity across $i,j<L$ indicates a shared effect on the unembedding regardless of where steering is applied. The last-layer shift implements another mechanism.
  • Figure 5: Penultimate-layer steering in Qwen2.5-Math-7B. Mean accuracy when injecting $s_{26}$ into a single projection of the final block: $Q$ (left), $K$ (center), $V$ (right). Placing $s_{26}$ only in $V_1$ closes the gap between Skip-Attn and $s_{26}$, indicating the effect is carried by the $V_1\!\to\!W^O$ path and is largely independent of $Q/K$ and attention weights.
  • ...and 21 more figures