Table of Contents
Fetching ...

A mathematical perspective on Transformers

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

TL;DR

This work presents a mathematical framework for Transformers by viewing self-attention as a mean-field interacting particle system on the unit sphere, and studies its long-time clustering and metastability through continuity equations, energy monotonicity, and gradient-flow perspectives. It connects token dynamics to Wasserstein gradient flows and established models of collective behavior, delivering rigorous results across small/large temperature regimes and high-dimensional settings, while outlining extensions toward full Transformer architectures and open questions. The analysis weaves together particle and measure viewpoints, detailing when and how a single cluster forms, and exposing rich phenomena such as metastability and phase transitions that echo practical observations in large language models. By offering a principled mathematical lens, the paper lays groundwork for further theoretical exploration of attention-driven dynamics and their implications for token organization and model behavior.

Abstract

Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.

A mathematical perspective on Transformers

TL;DR

This work presents a mathematical framework for Transformers by viewing self-attention as a mean-field interacting particle system on the unit sphere, and studies its long-time clustering and metastability through continuity equations, energy monotonicity, and gradient-flow perspectives. It connects token dynamics to Wasserstein gradient flows and established models of collective behavior, delivering rigorous results across small/large temperature regimes and high-dimensional settings, while outlining extensions toward full Transformer architectures and open questions. The analysis weaves together particle and measure viewpoints, detailing when and how a single cluster forms, and exposing rich phenomena such as metastability and phase transitions that echo practical observations in large language models. By offering a principled mathematical lens, the paper lays groundwork for further theoretical exploration of attention-driven dynamics and their implications for token organization and model behavior.

Abstract

Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.
Paper Structure (40 sections, 17 theorems, 195 equations, 6 figures, 1 table)

This paper contains 40 sections, 17 theorems, 195 equations, 6 figures, 1 table.

Key Result

Proposition 3.4

Let $\beta>0$ and $d\geq 2$. The unique global minimizer of $\mathsf{E}_\beta$ over $\mathcal{P}(\mathbb{S}^{d-1})$ is the uniform measureThat is, the Lebesgue measure on $\mathbb{S}^{d-1}$, normalized to be a probability measure.$\sigma_d$. Any global maximizer of $\mathsf{E}_\beta$ over $\mathcal{

Figures (6)

  • Figure 1: Histogram of $\{\langle x_i(t),x_j(t)\rangle\}_{(i,j)\in[n]^2, i\neq j}$ at different layers $t$ in the context of the trained ALBERT XLarge v2 model (lanalbert and https://huggingface.co/albert-xlarge-v2), which has constant parameter matrices. Here we randomly selected a single prompt, which in this context is a paragraph from a random Wikipedia entry, and then generate the histogram of the pairwise inner products. We see the progressive emergence of clusters all the way to the $24$th (and last) hidden layer (top), as evidenced by the growing mass at $1$. If the number of layers is increased, up to 48 say, the clustering is further enhanced (bottom).
  • Figure 2: Green zones indicate regimes where convergence to a single cluster as $t\to+\infty$ can be proven. Here $n\geq 2$ is fixed. When $d$ is larger than specific thresholds, the long-time asymptotics can be chiseled out in finer detail. Convergence is slow when $\beta\gg1$ (relative to the size of $d, n$), as even the exponential decay constant when $d\geq n$ is of the form $\lambda = O(e^{-\beta})$ and thus degenerates. One rather expects dynamic metastability. Section \ref{['sec: temperature']} addresses the case where $\beta$ is small and Section \ref{['sec: large.beta']} where it's large, whereas Section \ref{['sec: high.d']} covers the high-dimensional case at arbitrary $\beta$.
  • Figure 3: Plots of the probability that randomly initialized particles following \ref{['SA']} cluster to a single point as a function of $t$ and $\beta$: we graph the function $(t,\beta)\mapsto \mathbb{P}_{(x_1(0),\ldots,x_n(0))\sim\sigma_d}\left(\{\langle x_1(t),x_2(t)\rangle\geq1-\delta\}\right)$, which is equal to $(t,\beta)\mapsto \mathbb{P}_{(x_1(0),\ldots,x_n(0))\sim\sigma_d, i\neq j \text{ fixed}}\left(\{\langle x_1(t),x_2(t)\rangle\geq1-\delta\}\right)$ by permutation equivariance. We compute this function by generating the average of the histogram of $\{\langle x_i(t),x_j(t)\rangle\geq1-\delta\colon(i,j)\in[n]^2, i\neq j\}$ over $2^{10}$ different realizations of initial sequences. Here, $\delta=10^{-3}$, $n=32$, while $d$ varies. We see that the curve $\Gamma_{\infty,\delta}$ defined in \ref{['eq: gamma.infty']} approximates the actual phase transition with increasing accuracy as $d$ grows, as implied by \ref{['thm: phase.transition.curve']}.
  • Figure 4: We zoom in on the phase diagram (Figure \ref{['fig: phase.diag.Id']}) for the dynamics on the circle: $d=2$. For $\beta=4, 9$, we also display a trajectory of \ref{['SA']} for a randomly drawn initial condition at times $t=2.5, 18, 30$. We see that the particles settle at $2$ clusters when $\beta=4$ (bottom right) and $3$ clusters when $\beta=9$ (top right), for a duration of time. This reflects our metastability claim for large $\beta$.
  • Figure 5: Phase diagrams (see Figure \ref{['fig: phase.diag.Id']} for explanations) for some choices of random matrices $(Q, K, V)$; here $d=128$, $n=32$. Sharp phase transitions as well as metastable regions appear in all cases.
  • ...and 1 more figures

Theorems & Definitions (40)

  • Remark 2.1: Collective behavior
  • Remark 2.2: Permutation equivariance
  • Remark 3.1: Well-posedness
  • Remark 3.2: Positional encoding
  • Remark 3.3: Mean field limit
  • Proposition 3.4
  • proof : Proof of \ref{['prop: existence.uniqueness.energy']}
  • Remark 3.5: Doubly stochastic kernel
  • Lemma 3.6
  • Lemma 3.7
  • ...and 30 more