Table of Contents
Fetching ...

LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

Ashkan Shahbazi, Chayne Thrash, Yikun Bai, Keaton Hamm, Navid NaderiAlizadeh, Soheil Kolouri

TL;DR

LOTFormer tackles the quadratic complexity of traditional attention by framing attention as a transport problem and introducing a learnable pivot measure with $r$ support to enforce a low-rank, doubly stochastic coupling. By solving two entropic OT problems (queries to pivot and pivot to keys) and composing them into a glued coupling, it achieves $O(n d_k r)$ time without forming the full $n imes n$ attention matrix, while remaining end-to-end trainable. Empirically, LOTFormer delivers competitive or superior results on ImageNet 1K across multiple backbones, matches or surpasses state-of-the-art linear and DS baselines on Long Range Arena, and can be plugged into pretrained checkpoints for text benchmarks with modest tuning. This approach offers a practical path to robust, scalable attention that improves information flow and interpretability through a structured, pivot-mediated transport view.

Abstract

Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel feature maps, yet most attention mechanisms remain row normalized and can over concentrate mass on a few tokens, harming robustness and information flow. Doubly stochastic attention counteracts this by balancing token participation across both rows and columns, but existing approaches often add significant overhead. We propose LOTFormer, a linear time doubly stochastic attention mechanism derived from an optimal transport view of attention as a coupling between query and key measures. LOTFormer enforces a low rank transport plan by conditioning on a learnable pivot measure with small support. We solve two entropic transport problems, queries to pivot and pivot to keys, and compose them into a conditional coupling that is provably doubly stochastic, has rank at most $r \ll n$, and applies to values in $O(nr)$ time without forming the full $n \times n$ matrix. The pivot locations and masses are learned end-to-end. Across vision and text benchmarks, LOTFormer delivers strong accuracy efficiency tradeoffs when plugged into standard backbones including Swin, DeiT, and BERT.

LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

TL;DR

LOTFormer tackles the quadratic complexity of traditional attention by framing attention as a transport problem and introducing a learnable pivot measure with support to enforce a low-rank, doubly stochastic coupling. By solving two entropic OT problems (queries to pivot and pivot to keys) and composing them into a glued coupling, it achieves time without forming the full attention matrix, while remaining end-to-end trainable. Empirically, LOTFormer delivers competitive or superior results on ImageNet 1K across multiple backbones, matches or surpasses state-of-the-art linear and DS baselines on Long Range Arena, and can be plugged into pretrained checkpoints for text benchmarks with modest tuning. This approach offers a practical path to robust, scalable attention that improves information flow and interpretability through a structured, pivot-mediated transport view.

Abstract

Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel feature maps, yet most attention mechanisms remain row normalized and can over concentrate mass on a few tokens, harming robustness and information flow. Doubly stochastic attention counteracts this by balancing token participation across both rows and columns, but existing approaches often add significant overhead. We propose LOTFormer, a linear time doubly stochastic attention mechanism derived from an optimal transport view of attention as a coupling between query and key measures. LOTFormer enforces a low rank transport plan by conditioning on a learnable pivot measure with small support. We solve two entropic transport problems, queries to pivot and pivot to keys, and compose them into a conditional coupling that is provably doubly stochastic, has rank at most , and applies to values in time without forming the full matrix. The pivot locations and masses are learned end-to-end. Across vision and text benchmarks, LOTFormer delivers strong accuracy efficiency tradeoffs when plugged into standard backbones including Swin, DeiT, and BERT.

Paper Structure

This paper contains 55 sections, 1 theorem, 30 equations, 3 figures, 10 tables.

Key Result

Lemma 2.1

In the finite discrete measure setting, we have

Figures (3)

  • Figure 1: Illustration of LOTFormer. (a) Queries (red circles), keys (green squares), and the learnable pivot measure (black triangles). (b) Factorization of the transport plan: instead of solving directly for a full $n\times n$ coupling between queries and keys, LOTFormer first computes transport from queries$\to$pivot and pivot$\to$keys, then composes them into a glued coupling. (c--d) Effective query–key couplings induced by the pivot measure for different pivot sizes ($r=2,3,4,5,10$). Top row of each block shows the mediated connections via pivots, and bottom row shows the resulting query–key couplings. (c) Without entropic regularization ($\varepsilon=0$), couplings are sharp and sparse. (d) With entropic regularization ($\varepsilon=1$), couplings become smoother and more diffuse.
  • Figure 2: Patch-level visualizations of [CLS] attention. (Left) Comparison of standard Softmax attention with LOTFormer at different pivot sizes $r\!\in\!\{4,8,16,32\}$, showing how larger $r$ produces sharper, more object-centric maps. (Right) Effect of different [CLS] treatments (all without DWC): enforcing DS on [CLS] (full DS) degrades global aggregation, whereas decoupling it via [CLS]–softmax restores broad coverage, and adding polarization (+Pola) further sharpens selectivity. The leftmost column preserves the original image for context, while the other columns use a neutral gray background to standardize contrast.
  • Figure 3: Runtime scaling with sequence length. Forward-pass runtime (ms/iteration) vs. $N$ on a log–log scale for quadratic methods (left) and linear methods (right). LOTFormer is shown at ranks $r\!\in\!\{16,32,64,128,256\}$ in both panels (solid for $r{=}64$, dashed otherwise). Points denote measured values at $N\!\in\!\{2^{9:17}\}$.

Theorems & Definitions (2)

  • Lemma 2.1
  • proof