A mathematical perspective on Transformers
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet
TL;DR
This work presents a mathematical framework for Transformers by viewing self-attention as a mean-field interacting particle system on the unit sphere, and studies its long-time clustering and metastability through continuity equations, energy monotonicity, and gradient-flow perspectives. It connects token dynamics to Wasserstein gradient flows and established models of collective behavior, delivering rigorous results across small/large temperature regimes and high-dimensional settings, while outlining extensions toward full Transformer architectures and open questions. The analysis weaves together particle and measure viewpoints, detailing when and how a single cluster forms, and exposing rich phenomena such as metastability and phase transitions that echo practical observations in large language models. By offering a principled mathematical lens, the paper lays groundwork for further theoretical exploration of attention-driven dynamics and their implications for token organization and model behavior.
Abstract
Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.
