Table of Contents
Fetching ...

Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

Rabin Adhikari

TL;DR

This work probes the mechanistic underpinnings of transformer-based reasoning by training a minimal, attention-only model on a symbolic Indirect Object Identification (IOI) task. It demonstrates that a one-layer, two-head transformer can achieve perfect IOI performance, realized through a parsimonious additive-contrastive circuit uncovered via residual analysis and spectral scrutiny. Extending to a two-layer, one-head setting reveals cross-layer composition, where information is integrated across layers to replicate the same task performance, albeit through different architectural pathways. The findings argue that task-constrained training can reveal interpretable, minimal circuits, providing a controlled testbed to study coreference-like reasoning and offering insights into the primitive mechanisms that may underlie reasoning in larger pretrained transformers.

Abstract

Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

TL;DR

This work probes the mechanistic underpinnings of transformer-based reasoning by training a minimal, attention-only model on a symbolic Indirect Object Identification (IOI) task. It demonstrates that a one-layer, two-head transformer can achieve perfect IOI performance, realized through a parsimonious additive-contrastive circuit uncovered via residual analysis and spectral scrutiny. Extending to a two-layer, one-head setting reveals cross-layer composition, where information is integrated across layers to replicate the same task performance, albeit through different architectural pathways. The findings argue that task-constrained training can reveal interpretable, minimal circuits, providing a controlled testbed to study coreference-like reasoning and offering insights into the primitive mechanisms that may underlie reasoning in larger pretrained transformers.

Abstract

Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.

Paper Structure

This paper contains 22 sections, 10 figures.

Figures (10)

  • Figure 1: Single-Head, One-Layer Model Fails to Learn IOI. (a) The attention heatmap showing the <MID> token attends uniformly to the two names. (b) The QK circuit reveals that the <MID> token attends uniformly to all tokens. (c) The OV circuit shows that each name token has a large positive contribution to its own logit and a small negative contribution to the other name's logit.
  • Figure 2: Average Attention Heatmap for Two-Head, One-Layer Model. Head 0 focuses almost equally on the two name tokens from the dependent clause, while Head 1 has half of its attention on the subject of the main clause and almost a quarter on each of the names in the dependent clause.
  • Figure 3: The attention map for the second head depends on the template. While the first head always attends to the two name tokens in the dependent clause, the second head attends to the second occurrence of the subject in the main clause and the other name in the dependent clause --- "BA" in the "BAAB" template and similarly, "AB" in the "BABA" template.
  • Figure 4: Residual Stream Decomposition for Two-Head, One-Layer Model. The dot product of the output of each of the components of the residual stream. The first head's output aligns most with the sum direction, while the second head's output aligns most with the difference direction.
  • Figure 5: Eigenvalue Distribution of QK Circuits for Two-Head, One-Layer Model. Head 1 has a larger dominant negative eigenvalue (positive fraction of $-0.65$) compared to Head 0 (positive fraction of $-0.06$), indicating a stronger suppressive effect in Head 1's attention dynamics.
  • ...and 5 more figures