Table of Contents
Fetching ...

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Jack Merullo, Carsten Eickhoff, Ellie Pavlick

TL;DR

The paper investigates why transformer language models exhibit prompt sensitivity and arbitrary item-order effects by uncovering low-rank communication channels that transfer information across layers. Using SVD to decompose OV and QK matrices, the authors identify interpretable, causally significant channels, demonstrate their role via IOI and Laundry List tasks, and show that targeted weight edits and subspace interventions can substantially alter behavior. The work provides evidence that intricate, content-independent structures learned during pretraining govern cross-layer information flow and offers methods for circuit discovery and steerable model editing. These findings advance mechanistic understanding of LMs and point toward practical interpretability and control strategies for complex language tasks.

Abstract

Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task, and find that it underlies a commonly used abstraction across many context-retrieval behaviors. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers, forming low-rank communication channels (Elhage et al., 2021) between layers. A particular 3D subspace in model activations in GPT-2 can be traversed to positionally index items in lists, and we show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. That is, the model has trouble copying the correct information from context when many items ``crowd" this limited space. By decomposing attention heads with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices alone. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

TL;DR

The paper investigates why transformer language models exhibit prompt sensitivity and arbitrary item-order effects by uncovering low-rank communication channels that transfer information across layers. Using SVD to decompose OV and QK matrices, the authors identify interpretable, causally significant channels, demonstrate their role via IOI and Laundry List tasks, and show that targeted weight edits and subspace interventions can substantially alter behavior. The work provides evidence that intricate, content-independent structures learned during pretraining govern cross-layer information flow and offers methods for circuit discovery and steerable model editing. These findings advance mechanistic understanding of LMs and point toward practical interpretability and control strategies for complex language tasks.

Abstract

Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task, and find that it underlies a commonly used abstraction across many context-retrieval behaviors. Specifically, we find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers, forming low-rank communication channels (Elhage et al., 2021) between layers. A particular 3D subspace in model activations in GPT-2 can be traversed to positionally index items in lists, and we show that this mechanism can explain an otherwise arbitrary-seeming sensitivity of the model to the order of items in the prompt. That is, the model has trouble copying the correct information from context when many items ``crowd" this limited space. By decomposing attention heads with the Singular Value Decomposition (SVD), we find that previously described interactions between heads separated by one or more layers can be predicted via analysis of their weight matrices alone. We show that it is possible to manipulate the internal model representations as well as edit model weights based on the mechanism we discover in order to significantly improve performance on our synthetic Laundry List task, which requires recall from a list, often improving task accuracy by over 20%. Our analysis reveals a surprisingly intricate interpretable structure learned from language model pretraining, and helps us understand why sophisticated LMs sometimes fail in simple domains, facilitating future analysis of more complex behaviors.
Paper Structure (43 sections, 2 equations, 36 figures, 1 table)

This paper contains 43 sections, 2 equations, 36 figures, 1 table.

Figures (36)

  • Figure 1: Language models are often sensitive to arbitrary changes in a prompt, for example the order in which objects are listed (right). This problem is more pronounced as the number of objects increases (left) even though it is not obvious where the issue stems from in the model. We broadly explore how information is routed through a model and focus on a mechanism that is in part responsible for this (in)ability.
  • Figure 2: Showing the relationship between the composition score (weight-based, bottom) and inhibition score (data-based, top) between various inhibition head components and mover head 9.9 for the IOI task. The inhibition of each inhibition head is generally highly concentrated in one or two components of the matrix, removing it causes a large drop in the later mover head's ability to downweight one of the names. We therefore show that we can use the composition score when considering decomposed matrices.
  • Figure 3: Because component matrices are rank-1, their output spaces are 1D and interpreting them becomes easier. On the left, inhibition component activations go to either side of the origin , and selectively inhiibt the name in either position one or position two in the IOI task. We can scale a vector lying on this line by some scalar alpha and observe how this changes behavior when we add it to the residual stream, or replace the output of an attention head with it (right), which we show in Figure \ref{['fig:ioi_intervs']}.
  • Figure 4: We find that the 1D inhibition components and 2D duplicate token components finely control which name is avoided by the mover head. On the top, we can selectively inhibit either the first or second name depending on how we scale a vector lying on the 8.10.1 output space. This is strictly controlling relative position. On the bottom, we find that adding or removing duplicate token information from the duplicate channel at the IO or S1 tokens also effectively modulates which name is inhibited. Neither random heads, nor non-communication channel components exhibit these same effects (right). See Appendix \ref{['sec:more_intervs']} for results on other heads.
  • Figure 5: Scaling the inhibition component for a single head (here 8.10, left) is not expressive enough to get the mover head to index between the various objects. Scaling the top three inhibition components (middle) gives us enough expressive power to selectively attend to one of the objects. Here, one dot represents a run on the corresponding dataset and the color represents the index of the object the mover head pays the most attention to on average. A surprising structure emerges that partitions the space according to the index of the objects. However, the neat structure begins to break down as the number of objects grows around 10 or higher, and affects the mover head's ability to attend to the right object, which impacts accuracy. Right: Accuracy improvements as a result of sampling from inhibition space. The model becomes much more capable of handling a bigger number of objects in that the accuracy for N objects after the intervention is about as high as the unaltered model when it sees N/2 objects. However, the representational power of the inhibition channel reaches capacity as the number of objects increases, and performance can not improve as much.
  • ...and 31 more figures