Where does an LLM begin computing an instruction?
Aditya Pola, Vineeth N. Balasubramanian
TL;DR
This work tackles the problem of locating where instruction-following starts within transformer models. It introduces three small, well-controlled datasets and two-hop task variants, applying activation patching to identify the onset layer across Llama models. The core finding is a distinct onset depth where instruction signals become causally decisive; after onset, perturbing instruction-related activations no longer affects the output, even in multi-hop scenarios. This onset provides a reproducible coordinate for benchmarking instruction-following behavior and diagnosing model dynamics, with implications for model design and evaluation in mechanistic interpretability contexts.
Abstract
Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.
