How Transformers Get Rich: Approximation and Dynamics Analysis
Mingze Wang, Ruoxi Yu, Weinan E, Lei Wu
TL;DR
This work provides a rigorous framework for understanding how Transformers implement induction heads to enable in-context learning. It formalizes vanilla IH_2 and generalized IH_n (IH_n and GIH_n), and shows two-layer Transformers with multi-head attention (and FFNs in the generalized case) can efficiently approximate these mechanisms with explicit error bounds that depend on head count and embedding dimension. The dynamics analysis on a synthetic mixed target (a 4-gram plus an in-context 2-gram) reveals a sharp, four-phase transition from lazy n-gram behavior to rich induction-head behavior, driven by time-scale separation between DP and RPE and by component proportions. These insights illuminate the internal mechanics behind in-context learning and offer design principles for architectures aiming to leverage induction-head-like capabilities.
Abstract
Transformers have demonstrated exceptional in-context learning capabilities, yet the theoretical understanding of the underlying mechanisms remains limited. A recent work (Elhage et al., 2021) identified a ``rich'' in-context mechanism known as induction head, contrasting with ``lazy'' $n$-gram models that overlook long-range dependencies. In this work, we provide both approximation and dynamics analyses of how transformers implement induction heads. In the {\em approximation} analysis, we formalize both standard and generalized induction head mechanisms, and examine how transformers can efficiently implement them, with an emphasis on the distinct role of each transformer submodule. For the {\em dynamics} analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component. This controlled setting allows us to precisely characterize the entire training process and uncover an {\em abrupt transition} from lazy (4-gram) to rich (induction head) mechanisms as training progresses.
