Table of Contents
Fetching ...

Circuit Component Reuse Across Tasks in Transformer Language Models

Jack Merullo, Carsten Eickhoff, Ellie Pavlick

TL;DR

The paper demonstrates that circuit components learned for the IOI task in transformer language models generalize to a different task (Colored Objects), with about 78% overlap in in-circuit heads. Using path patching and causal interventions on GPT2-Medium, the authors reproduce the IOI circuit and reveal a largely shared, task-general set of algorithmic building blocks that govern both tasks. They further show that a targeted intervention can repair the Colored Objects circuit, boosting accuracy from 49.6% to 93.7% and aligning downstream head behavior with IOI dynamics, indicating robustness of the subcircuit across inputs. Collectively, these results support the view that large language models operate via a small number of interpretable, reusable components that compose across tasks, enabling more predictable understanding and potential control of model behavior.

Abstract

Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.

Circuit Component Reuse Across Tasks in Transformer Language Models

TL;DR

The paper demonstrates that circuit components learned for the IOI task in transformer language models generalize to a different task (Colored Objects), with about 78% overlap in in-circuit heads. Using path patching and causal interventions on GPT2-Medium, the authors reproduce the IOI circuit and reveal a largely shared, task-general set of algorithmic building blocks that govern both tasks. They further show that a targeted intervention can repair the Colored Objects circuit, boosting accuracy from 49.6% to 93.7% and aligning downstream head behavior with IOI dynamics, indicating robustness of the subcircuit across inputs. Collectively, these results support the view that large language models operate via a small number of interpretable, reusable components that compose across tasks, enabling more predictable understanding and potential control of model behavior.

Abstract

Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
Paper Structure (51 sections, 25 figures)

This paper contains 51 sections, 25 figures.

Figures (25)

  • Figure 1: An example from the the modified Colored Objects task. All inputs are one shot, where the first example is the In Context (IC) example, and the second is the Test example. The goal of the task is to predict the correct color (denoted by the 'X') and ignore the other color options (in particular, the other test example colors, denoted by a square).
  • Figure 2: Average attention paid to tokens in the input by content gatherer heads. Most attention is paid to the obj$_2$ token and the word ' color'.
  • Figure 3: Left: The full graph of attention heads with the difference of the path patching importance scores between each task (normalized by task per each path patching iteration). Middle: Visualizing only the union of the top 2% most important heads (per path patching stage) for each task, colored by the difference in importance scores. Right: Explanation of each stage of processing in the circuit. Both circuits involve the same general process: detecting a duplication and using that duplication to decide which token to copy. In the Colored Objects task, the duplication is used as a "positive" signal via the content gatherer heads to tell the mover heads which token to copy, while in IOI the duplication sends a "negative" signal via the inhibition heads to tell the mover heads which tokens to ignore. These heads, and the activation of the negative mover head in IOI constitute the only major difference between the two tasks.
  • Figure 4: Analyzing one of the inhibition head's (12.3) activity on the Colored Objects task shows that it is attending strongly to test color words and the in-context label (and writing in the opposite direction in embedding space, when attention is high), although they do not affect the mover heads as they do in IOI. Scatter plots for the other two inhibition heads are shown in Appendix \ref{['sec:cobjs_inhibition']}. Colors indicate the color of the word being attended to. See Appendix \ref{['sec:glossary']} for explanations of the axes.
  • Figure 5: Intervening on the attention patterns of the inhibition heads and negative mover increase accuracy on the full dataset from 49.6% to 93.7%. Furthermore, the interventions (specifically on the inhibition heads) affect the mover heads in the ways predicted by the IOI circuit. The right two graphs show a comparison of the logit difference and the attention to wrong colors before and after the intervention; results for these two are taken only over the 496 examples GPT2-Medium originally gets right, for a fair comparison. This evidence together suggests that the inhibition-mover subcircuit is itself a manipulable structure within the model that is invariant to the highly different input domains that we used in our experiments. Error bars show standard error.
  • ...and 20 more figures