Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws
Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Yizhou Xu, Florent Krzakala, Lenka Zdeborová
TL;DR
The paper develops a high‑dimensional theory of single‑head tied attention trained by ERM, linking learned weight spectra to generalization. It maps ERM in attention to a generalized matrix sensing problem and solves it with approximate message passing to obtain exact asymptotics for training/test errors, interpolation/recovery thresholds, and the spectrum of the learned query–key map. The spectrum comprises a structured bulk plus spectral outliers that encode learned features, providing a direct quantitative bridge between spectral structure and generalization. The work also uncovers power‑law scaling laws for targets with heavy‑tailed spectra, showing sequential spectral recovery and universal exponents, thereby offering a principled explanation for emergence and scaling phenomena observed in transformers. Although based on simplifying assumptions, the results reproduce key qualitative phenomena and pave the way for extensions to more realistic data distributions and architectures.
Abstract
Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.
