An explainable transformer circuit for compositional generalization
Cheng Tang, Brenden Lake, Mehrdad Jazayeri
TL;DR
The paper tackles compositional generalization in transformers by uncovering a minimal, mechanistic QK circuit responsible for compositional induction in a compact encoder–decoder model. Through causal ablations, logit attribution, and path-patching, the authors identify a dominant Output Head and two linked circuits (K- and Q-circuits) that encode index-in-question and relative-index-on-LHS, respectively, enabling a program-like description of the algorithm. They demonstrate that precise activation edits, such as swapping positional indices, can steer the model's predictions in predictable ways, providing a direct pathway for model control. The work advances mechanistic interpretability by showing how complex compositional behavior can be decomposed into interpretable subcircuits and suggests directions for automated circuit-discovery methods in larger models.
Abstract
Compositional generalization-the systematic combination of known components into novel structures-remains a core challenge in cognitive science and machine learning. Although transformer-based large language models can exhibit strong performance on certain compositional tasks, the underlying mechanisms driving these abilities remain opaque, calling into question their interpretability. In this work, we identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer. Using causal ablations, we validate the circuit and formalize its operation using a program-like description. We further demonstrate that this mechanistic understanding enables precise activation edits to steer the model's behavior predictably. Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.
