Table of Contents
Fetching ...

Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

Elon Litman

TL;DR

The paper provides a principled foundation for scaled-dot-product attention by showing the forward pass solves a one-sided Entropic Optimal Transport problem with cost $C_j = -\langle \mathbf{q}, \mathbf{k}_j \rangle$ and entropy regularization, yielding the SDPA weights when $\varepsilon = \tau$. It then demonstrates that backpropagation implements an advantage-based policy gradient, where the gradient with respect to scores is proportional to the advantage $(u_j - \mathbb{E}_{p^*}[u])$, highlighting an RL interpretation with a variance-reduced baseline. The analysis introduces the Log-Sum-Exp potential as the dual representation of the forward problem, whose gradient recovers the attention distribution, and shows that the Hessian of this potential is proportional to the Fisher Information Matrix, tying the forward geometry to the learning dynamics. The work unifies forward inference and backward learning through information geometry, and suggests how changing the regularizer $\Omega(p)$ could yield alternative attention mechanisms (e.g., Sparsemax, $\alpha$-entmax, ALiBi) with different sparsity and bias properties. Overall, this perspective casts SDPA as an optimization-controlled mechanism where principled forward inference aligns with geometry-guided learning, with practical implications for designing new attention variants.

Abstract

The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

TL;DR

The paper provides a principled foundation for scaled-dot-product attention by showing the forward pass solves a one-sided Entropic Optimal Transport problem with cost and entropy regularization, yielding the SDPA weights when . It then demonstrates that backpropagation implements an advantage-based policy gradient, where the gradient with respect to scores is proportional to the advantage , highlighting an RL interpretation with a variance-reduced baseline. The analysis introduces the Log-Sum-Exp potential as the dual representation of the forward problem, whose gradient recovers the attention distribution, and shows that the Hessian of this potential is proportional to the Fisher Information Matrix, tying the forward geometry to the learning dynamics. The work unifies forward inference and backward learning through information geometry, and suggests how changing the regularizer could yield alternative attention mechanisms (e.g., Sparsemax, -entmax, ALiBi) with different sparsity and bias properties. Overall, this perspective casts SDPA as an optimization-controlled mechanism where principled forward inference aligns with geometry-guided learning, with practical implications for designing new attention variants.

Abstract

The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

Paper Structure

This paper contains 29 sections, 14 theorems, 115 equations, 1 table.

Key Result

theorem 3.3

For any finite query $\bm{q}$ and keys $\{\bm{k}_j\}$, if the EOT regularization parameter is set to the attention temperature, $\varepsilon = \tau$, the optimization problem in Definition def:eot_problem_attn has a unique solution $\bm{p}^\star$, which is identical to the scaled-dot-product attenti

Theorems & Definitions (48)

  • definition 3.1: Scaled-Dot-Product Attention (SDPA)
  • definition 3.2: One-Sided Entropic Optimal Transport Problem
  • theorem 3.3: Attention as the Unique EOT Solution
  • proof
  • remark 3.4: A Fresh View of $\tau$
  • remark 3.5: The Source Measure & Valid Couplings of SDPA
  • proposition 4.1: Softmax from Shannon Entropy
  • proposition 4.2: Sparsemax from the L2 Norm
  • proof
  • proposition 4.3: $\alpha$-entmax from Tsallis Entropy
  • ...and 38 more