Patches of Nonlinearity: Instruction Vectors in Large Language Models
Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych
TL;DR
This work investigates how instructions are represented inside instruction-tuned LLMs by locating Instruction Vectors in the residual stream after the final instructional token $T_{\text{inst}}$. It finds that IVs are localized, linearly separable by task semantics, yet engage in nonlinear, superadditive interactions that cannot be captured by additive causal graphs. To study this, the authors develop an intervention-free, locally-linear surrogate framework for tracing information flow in transformers and demonstrate that IVs function as circuit selectors that guide distinct information pathways in later layers. They validate across base, SFT, and DPO variants on a suite of eight tasks including BigBench, with IA scores often above 50%, highlighting both the robustness of instruction following and the need for new interpretability tools. The findings challenge the traditional linear representation hypothesis and have implications for robust alignment and mechanistic interpretability.
Abstract
Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
