Table of Contents
Fetching ...

A Unified Perspective on the Dynamics of Deep Transformers

Valérie Castin, Pierre Ablin, José Antonio Carrillo, Gabriel Peyré

TL;DR

The paper develops a unified mean-field framework to study the dynamics of deep Transformer models by casting token evolution as a Vlasov-type PDE for probability measures. It systematically analyzes multiple self-attention variants, proves well-posedness for compactly supported data, and extends to Gaussian initial data where Gaussianity is preserved, yielding tractable ODEs for means and covariances and revealing clustering behavior. A gradient-flow perspective is introduced, linking Transformer dynamics to Wasserstein and Bures-Wasserstein geometries, entropic OT, and a twisted Wasserstein metric, thereby illuminating convergence properties and non-convexity phenomena. The results provide theoretical foundations for understanding clustering and anisotropy evolution in deep transformers and connect their dynamics to well-developed optimal-transport and geometric-analytic frameworks. Collectively, the work offers a rigorous, versatile toolkit for analyzing nonlocal, layerwise interactions in attention-based architectures and suggests directions for designing dynamics with desirable clustering or smoothing properties.

Abstract

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.

A Unified Perspective on the Dynamics of Deep Transformers

TL;DR

The paper develops a unified mean-field framework to study the dynamics of deep Transformer models by casting token evolution as a Vlasov-type PDE for probability measures. It systematically analyzes multiple self-attention variants, proves well-posedness for compactly supported data, and extends to Gaussian initial data where Gaussianity is preserved, yielding tractable ODEs for means and covariances and revealing clustering behavior. A gradient-flow perspective is introduced, linking Transformer dynamics to Wasserstein and Bures-Wasserstein geometries, entropic OT, and a twisted Wasserstein metric, thereby illuminating convergence properties and non-convexity phenomena. The results provide theoretical foundations for understanding clustering and anisotropy evolution in deep transformers and connect their dynamics to well-developed optimal-transport and geometric-analytic frameworks. Collectively, the work offers a rigorous, versatile toolkit for analyzing nonlocal, layerwise interactions in attention-based architectures and suggests directions for designing dynamics with desirable clustering or smoothing properties.

Abstract

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.

Paper Structure

This paper contains 55 sections, 40 theorems, 275 equations, 8 figures.

Key Result

Theorem 3.1

Let $d, k\in \mathbb{N}^*$ with $k\le d$ and $p\ge 1$. Let $Q, K\colon [0, +\infty) \to \mathbb{R}^{k\times d}$ and $V \colon [0, +\infty) \to \mathbb{R}^{d\times d}$ be three continuous maps, modeling the evolution of parameters $Q, K, V$ across layers of the Transformer. We set $\varepsilon = 1$ f with initial condition $\mu_0$ has a unique global weak solution $\mu \in {\@fontswitch{}{\mathcal{

Figures (8)

  • Figure 1: Evolution of the covariance matrix of a 2-dimensional Gaussian measure that goes through the Transformer PDE. The plots (a), (b), and (d) were obtained with Softmax self-attention, respectively with (a) $V$ random and $A + A^\top \prec 0$, (b) $V=I_2$ and $A + A^\top \preceq 0$ of rank 1 and (d) $V$ and $A$ chosen specifically to obtain this pattern. The plot (c) corresponds to multi-head self-attention with $V = I_2$ and $A + A^\top \preceq 0$ of rank 1.
  • Figure 1: Comparison of the behavior of Softmax, L2 and Multi-head attention in the setting of Figure \ref{['fig:well_posed_dynamics']}. All plots correspond to the same parameters, with $V$ random and $A + A^\top \prec 0$. We observe very similar behaviors.
  • Figure 2: Projection on the set of trace-1 matrices of the dynamics of the covariance matrix of a Gaussian measure that goes through the Transformer PDE, in cases where curves blow up or diverge. The plots (a), (b), and (c) were obtained with the same parameters ($V = I_2$ and $Q,K$ fixed so that $A+A^\top \succ 0$), respectively for Softmax, multi-head and L2 self-attention. In (a) and (b), the dynamics explode in finite time, while it is well-posed (but diverging) in (c). Finally, in (d), some of the initializations lead to a finite-time blow-up (purple curves) while others lead to convergence of the covariance matrix (yellow/green curves). (d) was obtained with Softmax self-attention but we observed a very similar behavior with L2 and multi-head self-attention (see Figure \ref{['appfig:both']} in Appendix \ref{['appsubsec:experiments']}).
  • Figure 2: Evolution of the covariance matrix of a 2-dimensional Gaussian measure that goes through the L2 Transformer PDE. All plots were obtained with L2 self-attention, with the same parameters as in Figure \ref{['fig:well_posed_dynamics']} (a, b, d). The behavior looks extremely similar as for Softmax self-attention.
  • Figure 3: Histogram of the rank of limiting points of the covariance equation for Softmax self-attention, in dimensions 3, 4, and 5. The matrix $V$ has full rank ($V = I_d$ in the upper row and $V$ random and different for each point in the lower row) and the matrix $A$ has rank $\lfloor d / 2\rfloor$, is random (constrained to be negative symmetric) and different for each point. Limiting points have low rank (smaller than $\lceil d / 2\rceil$), which parallels the clustering phenomenon observed for discrete tokens.
  • ...and 3 more figures

Theorems & Definitions (71)

  • Theorem 3.1
  • Proof 1
  • Definition 3.2: Conditional Wasserstein distance hosseini2023conditional
  • Remark 3.3
  • Theorem 3.4
  • Proof 2
  • Lemma 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • ...and 61 more