Table of Contents
Fetching ...

Homogenized Transformers

Hugo Koubbi, Borjan Geshkovski, Philippe Rigollet

Abstract

We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.

Homogenized Transformers

Abstract

We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.

Paper Structure

This paper contains 40 sections, 25 theorems, 362 equations, 4 figures.

Key Result

Theorem 1

Under Assumption ass:high_order_short, for any $\varphi\in C^4((\mathbb{S}^{d-1})^n)$ the following approximation holds uniformly until macroscopic time $t_L=\eta L$: where $C\geqslant 1$ depends on $\|\varphi\|_{C^4}$ but not on $\eta,\alpha,L$. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Qualitative phase diagrams in the $(\eta,\alpha)$-plane for fixed depth $L$ (left), together with the same picture in logarithmic coordinates (right), where straight lines correspond to power-law scalings of $\eta$ and $\alpha$ with respect to $L$. Different regions correspond to the rigorous regimes obtained from Theorem \ref{['thm:weak_error_clean']}. Specifically, "ODE I" (Corollary \ref{['cor:ode1']}) yields \ref{['eq: deterministic']} in the ballistic regime $\alpha\eta L=o(1)$ and $\eta^2L=o(1)$; "ODE II" (Corollary \ref{['cor:ode2']}) yields \ref{['eq: deterministic.modified']} in the refined deterministic regime $\alpha\eta L=o(1)$ and $\eta^3L=o(1)$; and "SDE" (Corollary \ref{['cor:SDE']}) yields the homogenized SDE \ref{['eq:SDE_ito_clean']} in the diffusive regime $\alpha\eta L=O(1)$ and $\eta^3L=o(1)$. The region labelled "Static" corresponds to $\eta L=o(1)$, so that no non-trivial macroscopic evolution is seen on the time scale $t_L=\eta L$, whereas "Failure of approximation" indicates scalings not covered by the present weak-approximation argument. We study the qualitative behavior of the homogenized model in the setting of centered Gaussian weights, for which only the diffusive regime yields non-stationary evolution---see \ref{['eq:Diffusive_gaussian_case']}.
  • Figure 2: Comparison between the curve $\gamma$ in Theorem \ref{['thm:clustering_random_init']} ( left) and that of geshkovski2025mathematical for $d=128$ ( right).
  • Figure 3: Theorem \ref{['thm:clustering_small_beta']} asserts that the second moment approaches the solution of the logistic equation \ref{['eq:logistic_u_smallbeta']} when $d$ is large and $\upbeta$ is small. Left:$n=200$ and $\upbeta=0.05$. Right:$n=5000$ and $\upbeta=0.001$.
  • Figure 4: Realizations of solutions to \ref{['eq: sde.logistic']} in Theorem \ref{['thm:large_beta_meta']}.

Theorems & Definitions (54)

  • Theorem 1
  • Remark 2.1: Multi-layer perceptrons
  • Corollary 1
  • Remark 2.2
  • Theorem 2
  • Remark 2.3
  • Theorem 3
  • Theorem 4
  • Lemma 1
  • Remark 3.1
  • ...and 44 more