Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
Viacheslav Sinii, Nikita Balagansky, Gleb Gerasimov, Daniil Laptev, Yaroslav Aksenov, Vadim Kurochkin, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov
TL;DR
The paper investigates how reinforcement-learning–based reasoning training reshapes internal computations of LLMs by injecting lightweight steering vectors $s_\ \ell$ into the residual stream of frozen bases. Using mechanistic interpretability tools (logit-lens, path-patching) across two math-capable models, it reveals that final-layer steering acts as a first-token substitution bias while penultimate-layer steering primarily modulates the MLP, with mid-layers suppressing non-English tokens; steering vectors are composable and transferable across related models, and adaptive token-wise magnitude further sharpens activation patterns. DiffSAE-based analyses connect steering-induced activation changes to features correlated with correct generations, and transfer experiments show directional alignment of latent vectors within model families. The work demonstrates that small, additive perturbations can reproduce much of the gains of full fine-tuning, offering concrete insights for activation engineering and deeper understanding of reasoning processes in LLMs. These findings hold practical significance for designing targeted interventions to shape reasoning without full retraining, while also highlighting model- and template-specific differences that warrant further cross-architecture study.
Abstract
The mechanisms by which reasoning training reshapes LLMs' internal computations remain unclear. We study lightweight steering vectors inserted into the base model's residual stream and trained with a reinforcement-learning objective. These vectors match full fine-tuning performance while preserving the interpretability of small, additive interventions. Using logit-lens readouts and path-patching analyses on two models, we find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as "To" and "Step"; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) middle layers de-emphasize non-English tokens. Next, we show that a SAE isolates features associated with correct generations. We also show that steering vectors (i) transfer to other models, (ii) combine across layers when trained in isolation, and (iii) concentrate magnitude on meaningful prompt segments under adaptive token-wise scaling. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.
