Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, Lenka Zdeborova
TL;DR
The paper analyzes SGD dynamics for Sequence Single-Index (SSI) models that generalize single-index learning to sequences with one-layer attention. It introduces the Sequence Information Exponent (SIE) via Hermite expansions and shows the population loss $R(w)$ depends only on the sufficient statistics $(e, m)$, yielding sharp SGD-sample-size scalings: $\mathcal{O}_L(d)$ for $\text{SIE}=1$, $\mathcal{O}_L(d\log^2 d)$ for $\text{SIE}=2$, and $\mathcal{O}_L(d^{\text{SIE}-1})$ for $\text{SIE}\ge 3$, with positional encoding able to reduce the SIE and potentially accelerate learning. The work also contrasts tied (linear attention) and untied networks to quantify a sequence-length–driven speedup, deriving a gain bound that scales with $L$ under favorable structure, and demonstrates a phase diagram where SGD can converge to semantic or positional minima depending on encoding and target structure. These results provide a rigorous, interpretable framework for understanding how sequential structure and positional encoding influence learning with attention-like models, guiding design choices for sequence tasks. Overall, the paper bridges theory for single- and multi-index models with modern sequence-attention architectures, offering precise predictions for sample complexity, convergence rates, and optimization landscapes in high dimensions.
Abstract
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.
