Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Jiajun Song; Zhuoyan Xu; Yiqiao Zhong

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Jiajun Song, Zhuoyan Xu, Yiqiao Zhong

TL;DR

The paper investigates how large language models generalize to out-of-distribution prompts by focusing on compositional reasoning within Transformers. Through a synthetic copying task and extensive experiments across multiple pretrained LLMs, it identifies a sharp transition where generalization emerges in tandem with subspace alignment between early and later attention components, encapsulated in the common bridge representation (CBR) hypothesis. Induction heads emerge as a central mechanism enabling composition, demonstrated across symbolized reasoning tasks and chain-of-thought scenarios, with a latent bridge subspace connecting reading and writing circuits across layers. These findings illuminate a mechanistic basis for OOD generalization, suggesting that a shared latent subspace underpins the ability to compose simple operations into complex reasoning, with practical implications for interpretability and prompt-driven capabilities in LLMs. Overall, the work links IHs, subspace matching, and the CBR framework to explain how multilayer attention architectures can generalize beyond their training distributions without parameter updates, advancing both theory and practical understanding of Transformer compositionality.

Abstract

Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together -- models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis.

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

TL;DR

Abstract

Paper Structure (72 sections, 15 equations, 27 figures, 3 tables)

This paper contains 72 sections, 15 equations, 27 figures, 3 tables.

Introduction
An exemplar: copying
Compositional structure is integral to OOD generalization
Background
Transformers
Transformer architecture.
Historical development.
Interpreting Transformers
OOD generalization
Dissecting sharp transition in synthetic example
Progress measures
Results
OOD generalization is accompanied by abrupt emergence of subspace matching.
Two layers have complementary specialties: position shifting and token matching.
Memorization vs. generalization: pattern diversity matters.
...and 57 more sections

Figures (27)

Figure 1: Training a two-layer Transformer (TF) and a one-layer TF for copying task using fresh samples of the format $(*, {\bm{s}}^\#, *, {\bm{s}}^\#, *)$. The models are evaluated on an in-distribution (ID) test dataset and an out-of-distribution (OOD) test dataset. Weak learning phase: the models rely on simple statistics of ID data and fail to generalize on OOD data; Rule-learning phase: two-layer TF learns the rule of copying from ID data and generalize well on ID/OOD data.
Figure 2: Measuring training dynamics for 2-layer 1-head Transformers on the copying synthetic data. First row: Test errors drop abruptly as structural matching occurs. Middle and right plots measure the matching between 1st-layer output circuit (OV) and 2nd-layer input circuit (QK). Second row: Model achieves OOD generalization by learning to compose two functionally distinct components (position matching vs. token matching). Left plot shows the formation of the IH on OOD data. Middle shows PTH scores on completely random tokens devoid of token info. Right shows token matching stripped of positional info.
Figure 3: Left:Memorization vs. generalization: more varied repetition patterns help models to learn the copying rule. When the set $\mathcal{S}$ of allowable ${\bm{s}}^\#$ during training has a size smaller than 740, models fail to learn the rule and generalize OOD under 20K steps, yet they can still memorize the patterns if the pool size is small. Right:Composition of two layers expresses the rule of copying. 1st-layer head shifts the embedding at [B] to [C]. Through the QK-OV circuits, the embedding at [C] then matches the last token [B] in the 2nd-layer attention calculation, resulting in attention to [C] and completes the copying task.
Figure 4: LLMs depend on induction heads (IHs) for symbolized language reasoning. (a) We rank attention heads and determine IHs as $K=50$ top-scoring heads. (b) We remove IHs by manually setting attention matrices to zero. (c) We sample instances according to the rule of each task and then construct OOD instances by symbolizing names/labels. (d) We measure the accuracy of various LLMs under IH removal. Symbolized tasks are indicated by blue names. We also report random baseline (deleting $50$ randomly selected heads) using $5$ random seeds, and report the variability using a segment showing $\pm$ one standard deviation. (e) We show the accuracy vs. varying $K$, where a smaller $K$ means deleting fewer heads.
Figure 5: Scaling experiment on fuzzy copying: how removal of induction heads impact Pythia models. We use the family of Pythia models of varying sizes, ranging from 36M parameters to 7B parameters. As we increase the number of removed induction heads, there is a consistent accuracy drop of all models on fuzzy copying.
...and 22 more figures

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

TL;DR

Abstract

Out-of-distribution generalization via composition: a lens through induction heads in Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (27)