Circuit Component Reuse Across Tasks in Transformer Language Models
Jack Merullo, Carsten Eickhoff, Ellie Pavlick
TL;DR
The paper demonstrates that circuit components learned for the IOI task in transformer language models generalize to a different task (Colored Objects), with about 78% overlap in in-circuit heads. Using path patching and causal interventions on GPT2-Medium, the authors reproduce the IOI circuit and reveal a largely shared, task-general set of algorithmic building blocks that govern both tasks. They further show that a targeted intervention can repair the Colored Objects circuit, boosting accuracy from 49.6% to 93.7% and aligning downstream head behavior with IOI dynamics, indicating robustness of the subcircuit across inputs. Collectively, these results support the view that large language models operate via a small number of interpretable, reusable components that compose across tasks, enabling more predictable understanding and potential control of model behavior.
Abstract
Recent work in mechanistic interpretability has shown that behaviors in language models can be successfully reverse-engineered through circuit analysis. A common criticism, however, is that each circuit is task-specific, and thus such analysis cannot contribute to understanding the models at a higher level. In this work, we present evidence that insights (both low-level findings about specific heads and higher-level findings about general algorithms) can indeed generalize across tasks. Specifically, we study the circuit discovered in Wang et al. (2022) for the Indirect Object Identification (IOI) task and 1.) show that it reproduces on a larger GPT2 model, and 2.) that it is mostly reused to solve a seemingly different task: Colored Objects (Ippolito & Callison-Burch, 2023). We provide evidence that the process underlying both tasks is functionally very similar, and contains about a 78% overlap in in-circuit attention heads. We further present a proof-of-concept intervention experiment, in which we adjust four attention heads in middle layers in order to 'repair' the Colored Objects circuit and make it behave like the IOI circuit. In doing so, we boost accuracy from 49.6% to 93.7% on the Colored Objects task and explain most sources of error. The intervention affects downstream attention heads in specific ways predicted by their interactions in the IOI circuit, indicating that this subcircuit behavior is invariant to the different task inputs. Overall, our results provide evidence that it may yet be possible to explain large language models' behavior in terms of a relatively small number of interpretable task-general algorithmic building blocks and computational components.
