Table of Contents
Fetching ...

An explainable transformer circuit for compositional generalization

Cheng Tang, Brenden Lake, Mehrdad Jazayeri

TL;DR

The paper tackles compositional generalization in transformers by uncovering a minimal, mechanistic QK circuit responsible for compositional induction in a compact encoder–decoder model. Through causal ablations, logit attribution, and path-patching, the authors identify a dominant Output Head and two linked circuits (K- and Q-circuits) that encode index-in-question and relative-index-on-LHS, respectively, enabling a program-like description of the algorithm. They demonstrate that precise activation edits, such as swapping positional indices, can steer the model's predictions in predictable ways, providing a direct pathway for model control. The work advances mechanistic interpretability by showing how complex compositional behavior can be decomposed into interpretable subcircuits and suggests directions for automated circuit-discovery methods in larger models.

Abstract

Compositional generalization-the systematic combination of known components into novel structures-remains a core challenge in cognitive science and machine learning. Although transformer-based large language models can exhibit strong performance on certain compositional tasks, the underlying mechanisms driving these abilities remain opaque, calling into question their interpretability. In this work, we identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer. Using causal ablations, we validate the circuit and formalize its operation using a program-like description. We further demonstrate that this mechanistic understanding enables precise activation edits to steer the model's behavior predictably. Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.

An explainable transformer circuit for compositional generalization

TL;DR

The paper tackles compositional generalization in transformers by uncovering a minimal, mechanistic QK circuit responsible for compositional induction in a compact encoder–decoder model. Through causal ablations, logit attribution, and path-patching, the authors identify a dominant Output Head and two linked circuits (K- and Q-circuits) that encode index-in-question and relative-index-on-LHS, respectively, enabling a program-like description of the algorithm. They demonstrate that precise activation edits, such as swapping positional indices, can steer the model's predictions in predictable ways, providing a direct pathway for model control. The work advances mechanistic interpretability by showing how complex compositional behavior can be decomposed into interpretable subcircuits and suggests directions for automated circuit-discovery methods in larger models.

Abstract

Compositional generalization-the systematic combination of known components into novel structures-remains a core challenge in cognitive science and machine learning. Although transformer-based large language models can exhibit strong performance on certain compositional tasks, the underlying mechanisms driving these abilities remain opaque, calling into question their interpretability. In this work, we identify and mechanistically interpret the circuit responsible for compositional induction in a compact transformer. Using causal ablations, we validate the circuit and formalize its operation using a program-like description. We further demonstrate that this mechanistic understanding enables precise activation edits to steer the model's behavior predictably. Our findings advance the understanding of complex behaviors in transformers and highlight such insights can provide a direct pathway for model control.

Paper Structure

This paper contains 38 sections, 1 equation, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: (a) Schematic of the transfomer model and task. (b) The prompt and output format for the compositional induction task.
  • Figure 2: Summary of circuit for compositional generalization. Top, the example episode's input and output. For a-e, the yellow boxes indicate self-attention heads and the blue boxes indicate cross-attention heads. Titles refer to the functional attention heads that execute the steps (discussed in detail later). We unfold all relevant information superimposed in tokens' embeddings and highlight their roles in attention operations. $[1]^*$, the $QK$ alignment discussed in Primitive-Pairing Head section. $[2]^*$, the $QK$ alignment discussed in Primitive-Retrieval Head section.
  • Figure 3: (a) Logit contributions of each decoder head to the logits of correct tokens. (b) Attention pattern of Dec-cross-1.5. (c) For Dec-cross-1.5, the percentage of attention focused on the next predicted token. (d) For Dec-cross-1.5, alignment (inner product) between its $OV$ output (e.g., $x_{red}W_vW_o$) and the corresponding unembedding vector (e.g., $\mathrm{Unemb}_{red}$). We estimated the null distribution by randomly sampling unembedding vectors.
  • Figure 4: Enc-self-1.1 and Enc-self-0.5 serve as the main contributors of the $K$-circuit for the Output Head.The $K$-circuit encodes primitive symbols' index-in-question.
  • Figure 5: (a) Top, contributions to Output Head’s performance (percentage of attention on the correct next token) via $K$. Bottom, attention pattern of Enc-self-1.1. (b) Top, contributions to the Output Head’s performance through the Primitive-Pairing Head's $V$. Bottom, attention pattern of Enc-self-0.5.
  • ...and 6 more figures