Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song, Zhuoyan Xu, Yiqiao Zhong
TL;DR
The paper investigates how large language models generalize to out-of-distribution prompts by focusing on compositional reasoning within Transformers. Through a synthetic copying task and extensive experiments across multiple pretrained LLMs, it identifies a sharp transition where generalization emerges in tandem with subspace alignment between early and later attention components, encapsulated in the common bridge representation (CBR) hypothesis. Induction heads emerge as a central mechanism enabling composition, demonstrated across symbolized reasoning tasks and chain-of-thought scenarios, with a latent bridge subspace connecting reading and writing circuits across layers. These findings illuminate a mechanistic basis for OOD generalization, suggesting that a shared latent subspace underpins the ability to compose simple operations into complex reasoning, with practical implications for interpretability and prompt-driven capabilities in LLMs. Overall, the work links IHs, subspace matching, and the CBR framework to explain how multilayer attention architectures can generalize beyond their training distributions without parameter updates, advancing both theory and practical understanding of Transformer compositionality.
Abstract
Large language models (LLMs) such as GPT-4 sometimes appear to be creative, solving novel tasks often with a few demonstrations in the prompt. These tasks require the models to generalize on distributions different from those from training data -- which is known as out-of-distribution (OOD) generalization. Despite the tremendous success of LLMs, how they approach OOD generalization remains an open and underexplored question. We examine OOD generalization in settings where instances are generated according to hidden rules, including in-context learning with symbolic reasoning. Models are required to infer the hidden rules behind input prompts without any fine-tuning. We empirically examined the training dynamics of Transformers on a synthetic example and conducted extensive experiments on a variety of pretrained LLMs, focusing on a type of components known as induction heads. We found that OOD generalization and composition are tied together -- models can learn rules by composing two self-attention layers, thereby achieving OOD generalization. Furthermore, a shared latent subspace in the embedding (or feature) space acts as a bridge for composition by aligning early layers and later layers, which we refer to as the common bridge representation hypothesis.
