Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo, Carsten Eickhoff, Ellie Pavlick
TL;DR
The paper investigates why transformer language models exhibit prompt sensitivity and arbitrary item-order effects by uncovering low-rank communication channels that transfer information across layers. Using SVD to decompose OV and QK matrices, the authors identify interpretable, causally significant channels, demonstrate their role via IOI and Laundry List tasks, and show that targeted weight edits and subspace interventions can substantially alter behavior. The work provides evidence that intricate, content-independent structures learned during pretraining govern cross-layer information flow and offers methods for circuit discovery and steerable model editing. These findings advance mechanistic understanding of LMs and point toward practical interpretability and control strategies for complex language tasks.
Abstract
Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task, and find that it underlies a commonly used abstraction across many context-retrieval behaviors. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers, forming low-rank communication channels (Elhage et al., 2021) between layers. A particular 3D subspace in model activations in GPT-2 can be traversed to positionally index items in lists, and we show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. That is, the model has trouble copying the correct information from context when many items ``crowd" this limited space. By decomposing attention heads with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices alone. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.
