Table of Contents
Fetching ...

What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation

Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe

TL;DR

This work investigates the emergence of induction heads (IHs) as key components of in-context learning (ICL) in transformers by introducing a causal, optogenetics-inspired framework that clamps activations during training. It reveals that IHs arise in an additive, redundant fashion with many-to-many wiring between previous-token heads and IHs, and shows that three interacting subcircuits drive the phase change in IH formation. Through activation-level clamping, the authors isolate Subcircuits A (previous-token attend/copy), B (IH QK match), and C (IH-copy) and demonstrate their smooth, co-evolving dynamics explain a data-dependent timing shift in IH formation. The study provides open-source tooling and a nuanced mechanistic view of how IHs are learned, offering a framework to predict learning dynamics from subcircuit formation in small-scale models with implications for understanding and debugging ICL in larger transformers.

Abstract

In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning -- the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss. Despite the robust evidence for IHs and this interesting coincidence with the phase change, relatively little is known about the diversity and emergence dynamics of IHs. Why is there more than one IH, and how are they dependent on each other? Why do IHs appear all of a sudden, and what are the subcircuits that enable them to emerge? We answer these questions by studying IH emergence dynamics in a controlled setting by training on synthetic data. In doing so, we develop and share a novel optogenetics-inspired causal framework for modifying activations throughout training. Using this framework, we delineate the diverse and additive nature of IHs. By clamping subsets of activations throughout training, we then identify three underlying subcircuits that interact to drive IH formation, yielding the phase change. Furthermore, these subcircuits shed light on data-dependent properties of formation, such as phase change timing, already showing the promise of this more in-depth understanding of subcircuits that need to "go right" for an induction head.

What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation

TL;DR

This work investigates the emergence of induction heads (IHs) as key components of in-context learning (ICL) in transformers by introducing a causal, optogenetics-inspired framework that clamps activations during training. It reveals that IHs arise in an additive, redundant fashion with many-to-many wiring between previous-token heads and IHs, and shows that three interacting subcircuits drive the phase change in IH formation. Through activation-level clamping, the authors isolate Subcircuits A (previous-token attend/copy), B (IH QK match), and C (IH-copy) and demonstrate their smooth, co-evolving dynamics explain a data-dependent timing shift in IH formation. The study provides open-source tooling and a nuanced mechanistic view of how IHs are learned, offering a framework to predict learning dynamics from subcircuit formation in small-scale models with implications for understanding and debugging ICL in larger transformers.

Abstract

In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning -- the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss. Despite the robust evidence for IHs and this interesting coincidence with the phase change, relatively little is known about the diversity and emergence dynamics of IHs. Why is there more than one IH, and how are they dependent on each other? Why do IHs appear all of a sudden, and what are the subcircuits that enable them to emerge? We answer these questions by studying IH emergence dynamics in a controlled setting by training on synthetic data. In doing so, we develop and share a novel optogenetics-inspired causal framework for modifying activations throughout training. Using this framework, we delineate the diverse and additive nature of IHs. By clamping subsets of activations throughout training, we then identify three underlying subcircuits that interact to drive IH formation, yielding the phase change. Furthermore, these subcircuits shed light on data-dependent properties of formation, such as phase change timing, already showing the promise of this more in-depth understanding of subcircuits that need to "go right" for an induction head.
Paper Structure (33 sections, 11 equations, 18 figures, 3 tables)

This paper contains 33 sections, 11 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: a) Schematic of an induction circuit, involving a previous token head in Layer 1 and an induction head in Layer 2. The side-by-side labels and exemplars in the residual stream after Layer 1 are meant to indicate that information about both is superimposed (perhaps in different subspaces). We highlightthe "matching" (green) and "copying" (blue) operations that span the two layers. Historically, focus has been devoted to the "match" operation. One of our key results is to demonstrate the important interactions from the "copy" operation. b) Example training sequences built from the Omniglot dataset and inspired by classical few-shot meta-training. The context consists of two exemplar-label pairs, where the exemplars are from different classes. The query exemplar comes from the same class as one of the exemplars in context. The in-context labels are randomly chosen. Every exemplar can appear with every possible label in every possible position, forcing the transformer to use ICL to minimize the training loss. Validation sequences either use held out class exemplars or held out pairs of labels.
  • Figure 2: Example pseudocode demonstrating a pattern preserving ablation using our framework.
  • Figure 3: a) Train and test loss curves. Transformers exhibit strong generalization to unseen classes (orange) and label pairs (green). The loss dynamics reveal a plateau (which may be indicative of a saddle point), where the model is randomly guessing between the two labels present in context (so it has 50% accuracy, instead of the chance level of 20% when there are $L=5$ labels). Then, there's a phase change in the loss which corresponds to the formation of induction circuits, reproducing the finding of InductionHeads. b) Induction head strength for each Layer 2 head plotted over time. Induction head strength is defined as the attention weight given to the correct label token minus that to the incorrect token. All heads appear to have some induction-like behavior, with Head 3 being the strongest and emerging first.
  • Figure 4: a) Effect of various ablations on accuracy (effects on loss are shown in Appendix Figure \ref{['fig:additive_heads_loss']}). Ablating any single head (triangles) leads to virtually no decrease in task performance, with the exception of Head 3, which leads to a 1% decrease. Ablating all but a specific head (circles) isolates how useful that specific head is, which correlates well to the induction strength (x-axis), the metric from Figure \ref{['fig:induction_heads']}b. Importantly, ablating Head 3 (pink triangle) performs very similar to ablating everything except Head 3 (pink circle), which indicates the other heads function additively, and together can make up for the deletion of Head 3. b) Training loss curves when training from scratch with only a single head from Layer 2 active (and the rest ablated). Black dotted line is the loss profile from the training run in Figure \ref{['fig:induction_heads']}. Colors chosen to match Figure \ref{['fig:induction_heads']}b. Each Layer 2 head on its own can learn to solve the task, though the timing of the phase change shifts and learning is slower.
  • Figure 5: a) Loss dynamics when clamping various variables in the toy model presented in Section \ref{['sec:clamping']}. Black shows the learning dynamics when no variable is clamped. Only when all other interacting components ($\mathbf{b}, \mathbf{c}$) are clamped does the loss curve become exponential. b) Loss dynamics when clamping various computations outlined in Section \ref{['sec:ih_steps']}. Black shows the training dynamics of the full network with nothing clamped. c) Induction circuit schematic (from Figure \ref{['fig:ih_schematic_methods']}a), with computation steps labeled. Arrow colors chosen to illustrate which steps are additionally clamped.
  • ...and 13 more figures