Table of Contents
Fetching ...

The emergence of sparse attention: impact of data distribution and benefits of repetition

Nicolas Zucchet, Francesco d'Angelo, Andrew K. Lampinen, Stephanie C. Y. Chan

TL;DR

This work investigates when sparse attention emerges during Transformer training and how data distribution and repetition shape that timing. It develops a tractable attention-based linear-regression toy model to derive explicit learning-dynamics, revealing a plateau followed by abrupt emergence and power-law scaling with sequence length and dimension; it also shows repetition accelerates emergence, with formulas linking burstiness and repetition probability to shorter plateaus. The theory is extended to more realistic Transformers and an in-context associative recall task, demonstrating data-driven predictions of emergence speed and highlighting the role of sparse attention in in-context learning. The findings offer a unifying perspective on emergence phenomena and suggest practical active-learning strategies that modulate data diversity to optimize learning trajectories.

Abstract

Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

The emergence of sparse attention: impact of data distribution and benefits of repetition

TL;DR

This work investigates when sparse attention emerges during Transformer training and how data distribution and repetition shape that timing. It develops a tractable attention-based linear-regression toy model to derive explicit learning-dynamics, revealing a plateau followed by abrupt emergence and power-law scaling with sequence length and dimension; it also shows repetition accelerates emergence, with formulas linking burstiness and repetition probability to shorter plateaus. The theory is extended to more realistic Transformers and an in-context associative recall task, demonstrating data-driven predictions of emergence speed and highlighting the role of sparse attention in in-context learning. The findings offer a unifying perspective on emergence phenomena and suggest practical active-learning strategies that modulate data diversity to optimize learning trajectories.

Abstract

Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

Paper Structure

This paper contains 44 sections, 50 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: A simple task to study the emergence of sparse attention. (left) We introduce a variant of linear regression task that is analytically tractable and in which Transformers-like models need to learn sparse attention. The model must identify which token (here the last one, $x_T$) is relevant for the target output $y^*$. We incorporate two realistic forms of repetition in the data: in-context repetition, where the relevant token appears multiple times within the context, and cross-sample repetition, where an input sequence contains a special token $\tilde{x}$ (here colored in green) at the relevant position with probability $p$. See Section \ref{['subsec:task']} for details. (right) a. As desired, the reduced learning dynamics of our simplified Transformer (Eq. \ref{['eq:simplified_attention_model']}) exhibit a multi-phase behavior including an initial plateau, on the task without repetition ($T = 512$). b. Mechanistically, the weights $w$ begin learning before attention to the relevant token $\alpha$ ($T = 512$, $d=64$). Dashed lines represent optimal values. c. The duration of the initial plateau increases as a function of sequence length $T$ and input/output dimension $d$, closely following a power law scaling relationship ($R^2 = 0.999$) that can be accurately predicted by linearizing the dynamics around initialization (Equation \ref{['eqn:emergence_time']}). See Section \ref{['subsec:analysis_norep']} for details.
  • Figure 2: Repetition speeds up emergence in the linear regression task in a theoretically predictable way. (left) Increasing in-context repetition through $B$ reduces the initial plateau, and the length of the plateau is well captured by the power law $T_\mathrm{plateau} = 1.51 \, T^{\,0.99} \, B^{-0.99} \, d^{\,0.49}$ ($R^2 = 0.999$). (right) Cross-sample repetition, modulated by the repetition probability $p$, exhibits similar effects, even when evaluating the model on a test loss without repetition (i.e., $p=0$). The length of the plateau follows $T_\mathrm{plateau} = 2.15 (\sqrt{d}T / \sqrt{p^2d + (1- p)^2})^{1.02}$ ($R^2 = 0.992$). The plateau length in both cases closely follows theoretical predictions. See Section \ref{['subsec:analysis_rep']} for details.
  • Figure 3: Theoretical insights on the linear regression task transfer to more realistic versions of the task, the model, and the optimizer. Transformer-based architectures exhibit similar phase transitions as our toy model and its corresponding plateau length follows similar trends to the ones derived in theory. (left and middle) Evolution of the plateau length as a function of $d$, when varying $T$ and $B$ (by default $T=256$ and $B=1$). The lines corresponds to the power law $T_\mathrm{plateau} = 0.76 \, d^{\,1.29} \, T^{\, 0.80} \, B^{\,-0.80}$ ($R^2 = 0.995$). (right) Same plot, this time varying the cross-sample repetition probability $p$. The lines correspond to the evolution of the average plateau length as $d$ increases, for the different $p$ values. See Section \ref{['subsec:validation_theory_transformer']} for details.
  • Figure 4: In-context learning emerges in the associative recall task, and emergence time grows with the number of pairs in the context and the vocabulary size. (left and middle) The ability to solve the task emerges through training, with emergence time increasing as the task gets harder (by increasing the number of pairs here, the vocabulary size being fixed to $256$). This is qualitatively consistent with the theory developed for sparse attention. (right) Systematically investigating the relationship between emergence time and data properties reveals that it follows the power law $T_\mathrm{plateau} = 0.55 \, N_\mathrm{tokens}^{0.79}\, N_\mathrm{pairs}^{2.25}$ ($R^2 = 0.982$). Results are only shown when the number of pairs is larger than the vocabulary size, to ensure that the query does not appear multiple times in the context (we do not consider repetition here). More details, in particular a hypothesis for why the $N_\mathrm{pairs}$ exponent is so high, can be found in Section \ref{['subsec:ass_recall_results']}.
  • Figure 5: Repetition speeds up emergence in the in-context associative recall task but comes with overfitting risks. (left) We vary the amount of in-context repetition $B$, that is the number of times the query appears as a key in the context on average, and find significant benefits of small amount of repetition. Larger amounts of repetition lead to overfitting, but learning for long enough eventually leads to grokking. (right) Cross-sample repetition, more precisely the probability $p$ that the query is one of the $2$ repeated tokens, has similar effects. Results are obtained for $N_\mathrm{pairs} = 32$ and $N_\mathrm{tokens} = 256$.
  • ...and 9 more figures