A Mathematical Theory of Attention

James Vuckovic; Aristide Baratin; Remi Tachet des Combes

A Mathematical Theory of Attention

James Vuckovic, Aristide Baratin, Remi Tachet des Combes

TL;DR

The paper provides a principled, measure-theoretic formulation of attention that is mathematically equivalent to traditional attention and compatible with the Transformer. By modeling attention as a nonlinear Markov transport on probability measures, it reveals an interacting-particle interpretation and a maximum-entropy characterization, then proves Lipschitz continuity in the 1-Wasserstein metric with quantitative bounds. The framework enables rigorous analysis of stability under mis-specified inputs and depth, including infinitely-deep, weight-sharing transformers, and yields concrete results for Lipschitz constants in $E=\,\mathbb{R}^d$ under decaying interaction potentials. These insights offer a general, gap-closing theory for attention that applies across domains and supports robustness and convergence analyses for deep attention-based architectures.

Abstract

Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.

A Mathematical Theory of Attention

TL;DR

Abstract

A Mathematical Theory of Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (75)