Table of Contents
Fetching ...

Patches of Nonlinearity: Instruction Vectors in Large Language Models

Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych

TL;DR

This work investigates how instructions are represented inside instruction-tuned LLMs by locating Instruction Vectors in the residual stream after the final instructional token $T_{\text{inst}}$. It finds that IVs are localized, linearly separable by task semantics, yet engage in nonlinear, superadditive interactions that cannot be captured by additive causal graphs. To study this, the authors develop an intervention-free, locally-linear surrogate framework for tracing information flow in transformers and demonstrate that IVs function as circuit selectors that guide distinct information pathways in later layers. They validate across base, SFT, and DPO variants on a suite of eight tasks including BigBench, with IA scores often above 50%, highlighting both the robustness of instruction following and the need for new interpretability tools. The findings challenge the traditional linear representation hypothesis and have implications for robust alignment and mechanistic interpretability.

Abstract

Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.

Patches of Nonlinearity: Instruction Vectors in Large Language Models

TL;DR

This work investigates how instructions are represented inside instruction-tuned LLMs by locating Instruction Vectors in the residual stream after the final instructional token . It finds that IVs are localized, linearly separable by task semantics, yet engage in nonlinear, superadditive interactions that cannot be captured by additive causal graphs. To study this, the authors develop an intervention-free, locally-linear surrogate framework for tracing information flow in transformers and demonstrate that IVs function as circuit selectors that guide distinct information pathways in later layers. They validate across base, SFT, and DPO variants on a suite of eight tasks including BigBench, with IA scores often above 50%, highlighting both the robustness of instruction following and the need for new interpretability tools. The findings challenge the traditional linear representation hypothesis and have implications for robust alignment and mechanistic interpretability.

Abstract

Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
Paper Structure (18 sections, 9 equations, 13 figures, 15 tables)

This paper contains 18 sections, 9 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: We locate instruction vectors (IVs) in the residual stream representation of the final instructional token, ${T}_\text{inst}$. Across models and tasks, we find that ${T}_\text{inst}$ stores sufficient instruction information, and that layerwise representations are more effective in combination (1) than alone (2), i.e. IVs are superadditive (§ \ref{['sec:identifying_ivs']}).
  • Figure 2: Effects of 1- and 2-layer patching configurations on the reciprocal rank of the target token. Each square of the x- and y-coordinate grid represents the corresponding layers of the model being patched. Coordinates where x=y represent 1-layer patching. Two important properties are shown. Localization: Across all tasks, we observe localized points where the logit improvement is the greatest. Superadditivity: Two-layer combinations bring greater improvements than single layers. Generalization to 7B models and other tasks is shown in the appendix.
  • Figure 3: Conceptual decomposition of the Transformer as a collection of locally-linear, token-to-token maps, indicating how information flows through the model. For each layer and token position, high-ranking paths (in color) to the output token may exist. Other paths (in gray) may be low-ranking, or may not lead to the target token.
  • Figure 4: Path contribution by token position for 1B models. For each task, we examine a subset of token positions in the prompt, and for each token position, we average the number of high-ranking paths over 100 task samples. The ${T}_\text{inst}$ tokens for each prompt are highlighted in blue.
  • Figure 5: Attention head activity for OLMo2-1B across the contrastive tasks. The values represent how often a head at a certain head/layer is on (% over 100 samples). We highlight the areas where the head activity is similar (blue) and diverges (purple). Note the similarity of the ${T}_\text{inst}$ tokens (subplots adj: comp, adj: ant, anim: color) and an earlier token of anim: can_fly, which has a multi-sentence instruction (see Section \ref{['sec:path_experiments']}).
  • ...and 8 more figures