A Mathematical Theory of Attention
James Vuckovic, Aristide Baratin, Remi Tachet des Combes
TL;DR
The paper provides a principled, measure-theoretic formulation of attention that is mathematically equivalent to traditional attention and compatible with the Transformer. By modeling attention as a nonlinear Markov transport on probability measures, it reveals an interacting-particle interpretation and a maximum-entropy characterization, then proves Lipschitz continuity in the 1-Wasserstein metric with quantitative bounds. The framework enables rigorous analysis of stability under mis-specified inputs and depth, including infinitely-deep, weight-sharing transformers, and yields concrete results for Lipschitz constants in $E=\,\mathbb{R}^d$ under decaying interaction potentials. These insights offer a general, gap-closing theory for attention that applies across domains and supports robustness and convergence analyses for deep attention-based architectures.
Abstract
Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.
