Table of Contents
Fetching ...

Where does an LLM begin computing an instruction?

Aditya Pola, Vineeth N. Balasubramanian

TL;DR

This work tackles the problem of locating where instruction-following starts within transformer models. It introduces three small, well-controlled datasets and two-hop task variants, applying activation patching to identify the onset layer across Llama models. The core finding is a distinct onset depth where instruction signals become causally decisive; after onset, perturbing instruction-related activations no longer affects the output, even in multi-hop scenarios. This onset provides a reproducible coordinate for benchmarking instruction-following behavior and diagnosing model dynamics, with implications for model design and evaluation in mechanistic interpretability contexts.

Abstract

Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

Where does an LLM begin computing an instruction?

TL;DR

This work tackles the problem of locating where instruction-following starts within transformer models. It introduces three small, well-controlled datasets and two-hop task variants, applying activation patching to identify the onset layer across Llama models. The core finding is a distinct onset depth where instruction signals become causally decisive; after onset, perturbing instruction-related activations no longer affects the output, even in multi-hop scenarios. This onset provides a reproducible coordinate for benchmarking instruction-following behavior and diagnosing model dynamics, with implications for model design and evaluation in mechanistic interpretability contexts.

Abstract

Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Overview: timing of instruction replacement determines the outcome. In a key–value prompt (CONTENT $\rightarrow$fruit:mango, metal:iron; INSTRUCTION KEYWORD $\rightarrow$fruit?), the instruction box shows unread/read icons to indicate whether the model has already read the instruction at the time of the intervention. The intervention is shown as the text “replace with metal.” In the upper path, the replacement appears after the instruction is read, and the answer remains mango. In the lower path, the replacement appears before it is read, and the answer flips to iron. Timing the replacement reveals when instruction following begins.
  • Figure 2: Single-hop keyword-span causal patching. Panels plot layer-wise flip rate—the fraction of items whose answer switches when substituting residuals at the keyword tokens—against transformer layer; the vertical orange bar marks the changepoint (instruction onset). Across Key–Value, Letter Selection, and Quote Attribution, and across Llama3.2-1B/3B and Llama3.1-8B, flip rate is high from early layers up to onset and then approaches baseline. This indicates a depth-localized boundary: after onset, replacing keyword residuals rarely affects the answer, whereas before onset it often does. Onset depth varies across tasks and models, but the pre/post contrast is consistent.
  • Figure 3: Multihop keyword-span causal patching. Layer-wise flip rate under residual-stream patching of the keyword span for two-hop compositions; the vertical orange bar marks the changepoint (instruction onset). Across models and hop orderings, curves remain elevated up to onset and then taper toward baseline, paralleling single-hop behavior. The onset depth matches single-hop (no systematic deeper shift), indicating that added procedural complexity leaves instruction localization unchanged and that post-onset computation proceeds without further reliance on the keyword tokens.
  • Figure 4: Answer-bottleneck flip-rate heatmap (Llama3.1–8B-Instruct). Heatmap of layer-wise flip rate when patching the final context token; red vertical lines mark the changepoint. Flip rates rise after onset and stay below keyword-span levels, indicating late consolidation of the instruction signal onto the last context token with weaker causal leverage than the original keyword tokens.