Dynamic metastability in the self-attention model
Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet
TL;DR
The paper proves dynamic metastability for a Transformer-inspired self-attention particle system on the unit sphere, showing that under well-separated, multi-cluster initial data the particles stay clustered in k caps for exponentially long times before coalescing to a single cluster. It achieves this via two complementary lenses: a direct ODE-based analysis that bounds escape and intra-cluster dynamics, and an energetic reinterpretation anchored in the Otto-Reznikoff slow-motion framework, identifying a slow manifold for E_β and Polyak-Łojasiewicz-type control near it. The authors extend the results to the mean-field regime and explore dynamics beyond metastability, revealing a staircase energy profile under suitable time rescalings that mirrors saddle-to-saddle behavior observed in neural network training. They also analyze natural initial data families (Gaussian mixtures, uniform points) and discuss implications and open questions for beyond-metastability dynamics and energy-level characterizations. Overall, the work connects geometric, energetic, and mean-field perspectives to understand long-time, multi-cluster metastable behavior in self-attention dynamics and its broader relevance to gradient-flow perspectives in learning systems.
Abstract
We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.
