Birth of a Transformer: A Memory Viewpoint
Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou
TL;DR
This work tackles how transformers balance global knowledge with in-context information by constructing a synthetic bigram task that separates persistent world knowledge from context-specific cues. Using a simplified two-layer transformer and an associative-memory perspective, the authors show global bigrams are learned quickly while an induction head emerges through top‑down gradient steps that tune key–query memories to capture in-context associations. Theoretical analyses link population gradients to memory formation and demonstrate how gradient updates can recover useful associations from noisy residual streams. Overall, the study provides a memory-centric lens on learning dynamics in transformers, with implications for optimization, data preprocessing, and mechanistic interpretability in language models.
Abstract
Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.
