Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport
Elon Litman
TL;DR
The paper provides a principled foundation for scaled-dot-product attention by showing the forward pass solves a one-sided Entropic Optimal Transport problem with cost $C_j = -\langle \mathbf{q}, \mathbf{k}_j \rangle$ and entropy regularization, yielding the SDPA weights when $\varepsilon = \tau$. It then demonstrates that backpropagation implements an advantage-based policy gradient, where the gradient with respect to scores is proportional to the advantage $(u_j - \mathbb{E}_{p^*}[u])$, highlighting an RL interpretation with a variance-reduced baseline. The analysis introduces the Log-Sum-Exp potential as the dual representation of the forward problem, whose gradient recovers the attention distribution, and shows that the Hessian of this potential is proportional to the Fisher Information Matrix, tying the forward geometry to the learning dynamics. The work unifies forward inference and backward learning through information geometry, and suggests how changing the regularizer $\Omega(p)$ could yield alternative attention mechanisms (e.g., Sparsemax, $\alpha$-entmax, ALiBi) with different sparsity and bias properties. Overall, this perspective casts SDPA as an optimization-controlled mechanism where principled forward inference aligns with geometry-guided learning, with practical implications for designing new attention variants.
Abstract
The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.
