Table of Contents
Fetching ...

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang

TL;DR

The paper addresses the theoretical understanding of in-context learning by analyzing the gradient-flow dynamics of a one-layer, multi-head softmax attention model applied to multi-task linear regression. By mapping parameter dynamics to spectral dynamics and introducing decomposable weights, the authors prove global convergence, characterize a three-phase learning process with emergent task specialization across heads, and establish optimality results showing the gradient-flow solution matches the best possible multi-head configuration up to a constant factor. They further show that multi-head attention yields a strict advantage over single-head models and identify an exponential versus saturation regime in the softmax mechanism, with regulatory effects that favor delocalized attention in noise settings. Extensions include connections to linear attention, length generalization, and the limits of transfer to nonlinear tasks. Overall, the work provides the first convergence results for multi-head softmax attention in the ICL context and clarifies how spectral dynamics underpin task allocation and performance bounds.

Abstract

We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.

Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality

TL;DR

The paper addresses the theoretical understanding of in-context learning by analyzing the gradient-flow dynamics of a one-layer, multi-head softmax attention model applied to multi-task linear regression. By mapping parameter dynamics to spectral dynamics and introducing decomposable weights, the authors prove global convergence, characterize a three-phase learning process with emergent task specialization across heads, and establish optimality results showing the gradient-flow solution matches the best possible multi-head configuration up to a constant factor. They further show that multi-head attention yields a strict advantage over single-head models and identify an exponential versus saturation regime in the softmax mechanism, with regulatory effects that favor delocalized attention in noise settings. Extensions include connections to linear attention, length generalization, and the limits of transfer to nonlinear tasks. Overall, the work provides the first convergence results for multi-head softmax attention in the ICL context and clarifies how spectral dynamics underpin task allocation and performance bounds.

Abstract

We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression. We establish the global convergence of gradient flow under suitable choices of initialization. In addition, we prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics, where each attention head focuses on solving a single task of the multi-task model. Specifically, we prove that the gradient flow dynamics can be split into three phases -- a warm-up phase where the loss decreases rather slowly and the attention heads gradually build up their inclination towards individual tasks, an emergence phase where each head selects a single task and the loss rapidly decreases, and a convergence phase where the attention parameters converge to a limit. Furthermore, we prove the optimality of gradient flow in the sense that the limiting model learned by gradient flow is on par with the best possible multi-head softmax attention model up to a constant factor. Our analysis also delineates a strict separation in terms of the prediction accuracy of ICL between single-head and multi-head attention models. The key technique for our convergence analysis is to map the gradient flow dynamics in the parameter space to a set of ordinary differential equations in the spectral domain, where the relative magnitudes of the semi-singular values of the attention weights determines task allocation. To our best knowledge, our work provides the first convergence result for the multi-head softmax attention model.
Paper Structure (142 sections, 43 theorems, 549 equations, 7 figures, 10 tables)

This paper contains 142 sections, 43 theorems, 549 equations, 7 figures, 10 tables.

Key Result

theorem 1

[theorem]thm:convergence-multi-head-symmetric_informal Under proper initialization (also satisfying def:decomposability property) of the gradient flow, assume that the number of heads is larger than the number of tasks, then the following holds:

Figures (7)

  • Figure 1: Illustration of the training loss and dynamics of the weights of the msa. The dynamics undergo three phases: warm-up, emergence, and convergence. Here, we set $H=3$ heads, $I=3$ homogeneous tasks with signal strength $\lambda=1$ and no noise, dimension $d=30$ and context sequence length $L=100$. The top two plots show the dynamics of the eigenvalues of two combined weight matrices, $\cleverbar{\omega}^{(h)}$ for $W_X^{(h)}$ and $\mu^{(h)}$ for $U_Y^{(h)}$ for all $h\in[H]$. The third plot shows the training loss and the last plot shows the value of $\cleverbar{\omega}_{i}^{{(h)}}$ upon convergence. Here, the optimal head for task $i$ is $h_i^\star = i$ for $i\in[I]$ by initialization.
  • Figure 2: Illustration of the msa with $H$ heads, embedding dimension ${d_e}$, and sequence length $L$. Weights $(V^{(h)}, K^{(h)}, Q^{(h)})$ can be split into the $X$ and $Y$ parts. The combined weights $(U^{(h)}, W^{(h)})$ are also split into $(U_X^{(h)}, U_Y^{(h)})$ and $(W_X^{(h)}, W_Y^{(h)})$. Under the decomposability assumption ($W_Y^{(h)} = 0$, $U_X^{(h)} = 0$), the attention scores only depend on the $X$ part, and the output is an aggregation of the $Y$ part. Although we introduce the concept of combined weights, it is important to note that the gradient flow operates on the individual weights.
  • Figure 3: Approximated dynamics of the gradient flow with $d_i = 10$, $d_e = 30$, $L=100$, $\phi_i =1$ and $\lambda_i =1$. In (a) we plot the dynamics of $\mu_i^\star$ in \ref{['eq:approximated_dyn_main3']}, starting from $t =4$ and $({\cleverbar{\omega}_{i}^{\star}}, \mu_{i}^{\star}) = (0.01, 1)$. In (b) we plot the dynamics of ${\cleverbar{\omega}_{i}^{\star}}$ in \ref{['eq:emergence']}, starting from $t =4$ and ${\cleverbar{\omega}_{i}^{\star}} = 0.01$. We note that the approximated dynamics given by \ref{['eq:approximated_dyn_main3']} and \ref{['eq:emergence']} are consistent with the numerical simulations in \ref{['fig:dyn']} based on data. In (c) we plot $\partial_t{\cleverbar{\omega}_{i}^{\star}}$ as a function of ${\cleverbar{\omega}_{i}^{\star}}$ with the same setup. When ${\cleverbar{\omega}_{i}^{\star}}$ is small (left), the growth of $\partial_t{\cleverbar{\omega}_{i}^{\star}}$ is slow. When ${\cleverbar{\omega}_{i}^{\star}}$ reaches the middle region, the growth of ${\cleverbar{\omega}_{i}^{\star}}$ accelerates. When ${\cleverbar{\omega}_{i}^{\star}}$ grows to the right end, it converges to $d_i^{-1/2}$.
  • Figure 4: Comparison between the lower bound LB-ICL and the Bayesian risk (MMSE) for different $H$ and $\overline{d}/L$ with $I=10$, $\lambda =\sigma^2=1$. Here, $\overline{d} = d/I$.
  • Figure 5: Comparison of the icl loss between (i) ridge regression with regularization parameter $\mathrm{penalty} = 0.05$ and $\mathrm{penalty} = 5$; (ii) the Bayesian risk, i.e., MMSE; (iii) the icl loss for the convergence point of the gradient flow \ref{['eq:icl-loss-convergence-point']} with $H\ge I$; (iv) the icl loss of the optimal single head attention \ref{['eq:water filling']}. Here, $\overline{d} = d/I$. For large $\overline{d}/L$, the GF-ICL loss behaves similarly to ridge regression with large $\mathrm{penalty}=5$ and does not have the double descent phenomenon that is observed with insufficient regularization $\mathrm{penalty}=0.05$. This is due to the softmax-induced regularization as we will further illustrate in §\ref{['sec:implicit_regularization']}. For small $\overline{d}/L$, we see that msa (GF-ICL) has an $H\land I=10$ improvement over the single-head attention (OptS-ICL). We refer readers to §\ref{['sec:icl loss comparison']} for more details.
  • ...and 2 more figures

Theorems & Definitions (87)

  • theorem 1: Gradient flow for msa, informal
  • definition 1
  • lemma 1: Preservation of Decomposibility along Gradient Flow
  • theorem 2: Convergence of gradient flow for msa
  • lemma 2: ICL Loss for msa
  • theorem 3: Optimal Loss for Single-Head Softmax Attention
  • definition 2: Equiangular Weights
  • theorem 4: Lower Bound of msa within Equiangular Weights
  • corollary 1
  • proposition 1: Length Generalization for msa
  • ...and 77 more