Table of Contents
Fetching ...

Predicting the Formation of Induction Heads

Tatsuya Aoyama, Ethan Gotlieb Wilcox, Nathan Schneider

TL;DR

This work probes how induction heads (IHs) form in transformers during pretraining, linking IH emergence to data properties and training configurations using natural, semi-natural, and synthetic data.A key contribution is the identification of a model- and data-agnostic law, $N_{pt} = T\sqrt{BC}$, that predicts the training token point at which IHs emerge, with $T \approx 10^{5.7}$; this law aligns across a wide scale of experiments and confirms a phase-transition-like behavior.The study also demonstrates that bigram repetition frequency and reliability jointly shape a Pareto frontier for IH formation, and that local dependency together with high repetition and reliability guarantees IH formation, while marginal distribution shape and categoricity modulate outcomes near the frontier.Overall, the findings illuminate concrete data- and configuration-driven mechanisms behind IHs, with implications for understanding and controlling in-context learning capabilities in large language models.

Abstract

Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

Predicting the Formation of Induction Heads

TL;DR

This work probes how induction heads (IHs) form in transformers during pretraining, linking IH emergence to data properties and training configurations using natural, semi-natural, and synthetic data.A key contribution is the identification of a model- and data-agnostic law, $N_{pt} = T\sqrt{BC}$, that predicts the training token point at which IHs emerge, with $T \approx 10^{5.7}$; this law aligns across a wide scale of experiments and confirms a phase-transition-like behavior.The study also demonstrates that bigram repetition frequency and reliability jointly shape a Pareto frontier for IH formation, and that local dependency together with high repetition and reliability guarantees IH formation, while marginal distribution shape and categoricity modulate outcomes near the frontier.Overall, the findings illuminate concrete data- and configuration-driven mechanisms behind IHs, with implications for understanding and controlling in-context learning capabilities in large language models.

Abstract

Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.

Paper Structure

This paper contains 25 sections, 20 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Developmental trajectories of of with various batch sizes (left), context sizes (center), and repetitions (right) over the course of 1B tokens of pretraining. BS, CS, %NR stands for batch size, context size, and the proportion of chunks with no repetitions, respectively.
  • Figure 2: Best across all heads at the end of the training for each frequency reliability combination. Scores are represented in colors, with brighter colors representing higher scores.
  • Figure 3: Developmental trajectories of of with various batch sizes (left), context sizes (center), and repetitions (right) over the course of 1B tokens of pretraining. BS, CS, %NR stands for batch size, context size, and the proportion of chunks with no repetitions, respectively.
  • Figure 4: Smoothed distribution of chunks with various numbers of bigram repetitions. Context sizes 1024 and 2048 were rendered invisible, and hence removed from the plot. The plot is truncated at y=0.6 for readability, but context sizes of 4, 8, and 16 had >95% of chunks with no bigram repetitions.
  • Figure 5: Development of the highest score across all heads at each checkpoint plotted against the number of updates (left column) and (right column). Different line colors represent different batch sizes (top row) and context sizes (bottom row). All four plots share the scales of the $x$/$y$-axes. The $x$-axis is in log-scale, since both batch and context sizes increase exponentially, and hence the number of updates decreases exponentially. The 2 red dotted lines per plot represent the column-wise range of the inflection points.
  • ...and 2 more figures