Table of Contents
Fetching ...

Dynamic metastability in the self-attention model

Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet

TL;DR

The paper proves dynamic metastability for a Transformer-inspired self-attention particle system on the unit sphere, showing that under well-separated, multi-cluster initial data the particles stay clustered in k caps for exponentially long times before coalescing to a single cluster. It achieves this via two complementary lenses: a direct ODE-based analysis that bounds escape and intra-cluster dynamics, and an energetic reinterpretation anchored in the Otto-Reznikoff slow-motion framework, identifying a slow manifold for E_β and Polyak-Łojasiewicz-type control near it. The authors extend the results to the mean-field regime and explore dynamics beyond metastability, revealing a staircase energy profile under suitable time rescalings that mirrors saddle-to-saddle behavior observed in neural network training. They also analyze natural initial data families (Gaussian mixtures, uniform points) and discuss implications and open questions for beyond-metastability dynamics and energy-level characterizations. Overall, the work connects geometric, energetic, and mean-field perspectives to understand long-time, multi-cluster metastable behavior in self-attention dynamics and its broader relevance to gradient-flow perspectives in learning systems.

Abstract

We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.

Dynamic metastability in the self-attention model

TL;DR

The paper proves dynamic metastability for a Transformer-inspired self-attention particle system on the unit sphere, showing that under well-separated, multi-cluster initial data the particles stay clustered in k caps for exponentially long times before coalescing to a single cluster. It achieves this via two complementary lenses: a direct ODE-based analysis that bounds escape and intra-cluster dynamics, and an energetic reinterpretation anchored in the Otto-Reznikoff slow-motion framework, identifying a slow manifold for E_β and Polyak-Łojasiewicz-type control near it. The authors extend the results to the mean-field regime and explore dynamics beyond metastability, revealing a staircase energy profile under suitable time rescalings that mirrors saddle-to-saddle behavior observed in neural network training. They also analyze natural initial data families (Gaussian mixtures, uniform points) and discuss implications and open questions for beyond-metastability dynamics and energy-level characterizations. Overall, the work connects geometric, energetic, and mean-field perspectives to understand long-time, multi-cluster metastable behavior in self-attention dynamics and its broader relevance to gradient-flow perspectives in learning systems.

Abstract

We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.

Paper Structure

This paper contains 33 sections, 14 theorems, 275 equations, 7 figures.

Key Result

Theorem 1.2

Suppose $d, n\geqslant 2$ and $\beta>1$. Consider $(x_i(0))_{i=1}^n\in(\mathbb{S}^{d-1})^n$ which is $(\beta,\varepsilon)$-separated for some $\varepsilon=\varepsilon(\beta)\in(0,\frac{1}{16})$. Let $(x_i(\cdot))_{i=1}^n \in \mathscr{C}^0(\mathbb{R}_{\geqslant0};(\mathbb{S}^{d-1})^n)$ be the unique (see rem: lambda.gamma for the precise upper bound) and where $\gamma=\gamma(\beta)>0$ and $\alpha

Figures (7)

  • Figure 1: An illustration of a $(\beta,\varepsilon)$-separated configuration on the circle $\mathbb{S}^{1}$. To clearly visualize distances, we not only show the spherical caps $\mathscr{S}_j(\varepsilon)$, but also their convex hull within the unit disk. The case of interest in our framework is that in which caps have an opening $\varepsilon$ that is much smaller than the distance $1-\alpha$ between them.
  • Figure 2: A stylized illustration of \ref{['thm: metastability']}: here $d=2$, $n=5$ and $\beta=4$, initial points are distributed uniformly at random, and \ref{['SA']} is solved using a forward Euler scheme with time step equal to $0.1$. Two caps appear and beyond time $T_1\sim 9$ particles within these caps are essentially merged. The dynamics remains in this metastable state at least up to time $T_2\sim 356$, a point beyond which the two merged rightmost particles exit the cap, $\mathscr{S}_1(2\varepsilon)$ say, and \ref{['thm: metastability']} is no longer indicative. Continued in \ref{['fig: circle.metastability.2']}.
  • Figure 3: Continuing upon \ref{['fig: circle.metastability.1']}, we see that particles keep converging until they meet at a cluster, which is the global maximum of $\mathsf{E}_\beta$. We recall that in this setup ($d=2$ and $\beta\not\gtrsim n^2$, nor are initial particles in some hemisphere), there is no proof of convergence to a cluster as of yet. A movie of the full evolution can be found at https://github.com/HugoKoubbi/2024-transformers-dotm/blob/main/video[tape]/1.gif.
  • Figure 4: \ref{['thm: staircase']} entails that the energy of a trajectory along the time-scale defined in \ref{['compt: reparam']} converges, uniformly in time, as $\beta\to+\infty$, to a piecewise constant-in time function which equals $1$ (designating the maximal value of $\mathsf{E}_\beta$) beyond some finite time $T_k>0$. Plateaux indicate metastable zones, and jumps in the energy level indicate rapprochement of nearby clusters.
  • Figure 5: An illustration of the landscape of $\mathsf{E}_\beta$. The slow manifold $\mathcal{N}$ is an almost-flat zone, thus one where the gradient flow moves very little, and is surrounded by zones where $\mathsf{E}_\beta$ satisfies a PL inequality.
  • ...and 2 more figures

Theorems & Definitions (49)

  • Definition 1.1: $(\beta,\varepsilon)$-separated configurations
  • Theorem 1.2
  • Remark 1.3: On \ref{['eq: lambda.1']}
  • Remark 1.4: $\Omega(1)$
  • Remark 1.5: Low temperature
  • Remark 1.6: Different heights
  • Remark 1.7: Safety caps
  • Remark 1.8: Time of collapse
  • proof : Proof of \ref{['thm: metastability']}
  • Lemma 2.1: Until collapse
  • ...and 39 more