Table of Contents
Fetching ...

A Mathematical Theory of Attention

James Vuckovic, Aristide Baratin, Remi Tachet des Combes

TL;DR

The paper provides a principled, measure-theoretic formulation of attention that is mathematically equivalent to traditional attention and compatible with the Transformer. By modeling attention as a nonlinear Markov transport on probability measures, it reveals an interacting-particle interpretation and a maximum-entropy characterization, then proves Lipschitz continuity in the 1-Wasserstein metric with quantitative bounds. The framework enables rigorous analysis of stability under mis-specified inputs and depth, including infinitely-deep, weight-sharing transformers, and yields concrete results for Lipschitz constants in $E=\,\mathbb{R}^d$ under decaying interaction potentials. These insights offer a general, gap-closing theory for attention that applies across domains and supports robustness and convergence analyses for deep attention-based architectures.

Abstract

Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.

A Mathematical Theory of Attention

TL;DR

The paper provides a principled, measure-theoretic formulation of attention that is mathematically equivalent to traditional attention and compatible with the Transformer. By modeling attention as a nonlinear Markov transport on probability measures, it reveals an interacting-particle interpretation and a maximum-entropy characterization, then proves Lipschitz continuity in the 1-Wasserstein metric with quantitative bounds. The framework enables rigorous analysis of stability under mis-specified inputs and depth, including infinitely-deep, weight-sharing transformers, and yields concrete results for Lipschitz constants in under decaying interaction potentials. These insights offer a general, gap-closing theory for attention that applies across domains and supports robustness and convergence analyses for deep attention-based architectures.

Abstract

Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equivalent model of attention using measure theory. With this model, we are able to interpret self-attention as a system of self-interacting particles, we shed light on self-attention from a maximum entropy perspective, and we show that attention is actually Lipschitz-continuous (with an appropriate metric) under suitable assumptions. We then apply these insights to the problem of mis-specified input data; infinitely-deep, weight-sharing self-attention networks; and more general Lipschitz estimates for a specific type of attention studied in concurrent work.

Paper Structure

This paper contains 31 sections, 35 theorems, 113 equations.

Key Result

Proposition 9

Let $G(x,y)=\exp(a(x,y))$, $L(k,\dd v)=\sum_{i=N}\ind{k=k_i}\delta_{v_i}(\dd v)$, and $Q,K,V$ be as in the definition of attention. Then, using the left action of kernels on measures, the mapping: implements attention as in Definition defn:attention.

Theorems & Definitions (75)

  • Definition 1: Attention, bahdanau2014neural
  • Definition 2: Markov kernel
  • Definition 3: Boltzmann-Gibbs Transformation
  • Definition 4: Softmach Kernel
  • Definition 5: Lookup Kernel
  • Definition 6: Moment Encoding
  • Definition 7: Moment Subspace of $\calP(E)$
  • Definition 8: Attention Kernel
  • Proposition 9
  • proof
  • ...and 65 more