Table of Contents
Fetching ...

Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention

Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava

TL;DR

This work targets the quadratic bottleneck of Softmax Attention in long-context transformers by introducing RACE Attention, a linear-time attention mechanism built on a sharpened angular (cosine) similarity and randomized LSH-based sketches. It provides a principled RandNLA-informed analysis, showing how per-head sketch parameters control bias-variance trade-offs, and demonstrates through extensive experiments that RACE matches strong baselines on standard tasks while scaling to tens of millions of tokens on CPU and GPU. Key contributions include the angular kernel formulation, a three-stage linear-time algorithm, a rigorous error bound, and comprehensive scaling results that outperform state-of-the-art attention implementations at extreme context lengths. The practical impact is a viable path to billion-token contexts on commodity hardware, with potential extensions to inference-only use and GPU-accelerated kernels.

Abstract

Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.

Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention

TL;DR

This work targets the quadratic bottleneck of Softmax Attention in long-context transformers by introducing RACE Attention, a linear-time attention mechanism built on a sharpened angular (cosine) similarity and randomized LSH-based sketches. It provides a principled RandNLA-informed analysis, showing how per-head sketch parameters control bias-variance trade-offs, and demonstrates through extensive experiments that RACE matches strong baselines on standard tasks while scaling to tens of millions of tokens on CPU and GPU. Key contributions include the angular kernel formulation, a three-stage linear-time algorithm, a rigorous error bound, and comprehensive scaling results that outperform state-of-the-art attention implementations at extreme context lengths. The practical impact is a viable path to billion-token contexts on commodity hardware, with potential extensions to inference-only use and GPU-accelerated kernels.

Abstract

Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.

Paper Structure

This paper contains 21 sections, 13 theorems, 87 equations, 7 figures, 13 tables, 2 algorithms.

Key Result

Lemma 1

Given a dataset $D$, an LSH family $H$ with finite range $[1, R]$ and a parameter $p$, construct an LSH function $h(x) \to [1, R^p]$ by concatenating $p$ independent hashes from $H$. Let $A$ be an ACE array constructed using $h(x)$. Then for any query $q$,

Figures (7)

  • Figure 1: This figure demonstrates the difference between the linear complexity of RACE Attention and the quadratic complexity of Softmax Attention mechanism. Specifically, we highlight how the final representation $o_5$ is computed under Softmax versus RACE. In Softmax, the entire fifth column of the attention score matrix is required. In contrast, RACE does not require the full matrix; instead, it aggregates statistics within LSH-mapped buckets, utilizing the appropriate collision probability $\alpha$ to compute $o_5$.
  • Figure 2: Comparison of Softmax and Angular kernels at different sharpening levels $\gamma$. As $\gamma$ (or non-linearity) increases, Angular transitions from flat similarity scores to a sharper distribution, recovering behavior similar to the exponential in the Softmax.
  • Figure 3: A rigorous scaling stress-test across hardware. The top row shows GPU scaling results; the bottom row shows CPU scaling results. We run a single forward-backward pass configured with 1 batch, 4 heads, and an embedding dimension of 128. Linformer and Performer use the same low-rank/feature dimension as in Table \ref{['tab:imdb']}.
  • Figure 4: A rigorous scaling stress-test (including FlashAttention) across hardware. Plots (a)–(b) use logarithmic axes. RACE is evaluated with $(P{=}2,L{=}2,M{=}1)$ throughout; Linformer and Performer use the same low-rank/feature dimension as in Table \ref{['tab:imdb']}.
  • Figure 5: A rigorous scaling test for algorithmic comparison between FlashAttention on GPU vs. RACE Attention on CPU
  • ...and 2 more figures

Theorems & Definitions (13)

  • Lemma 1: Theorem 1 of ColemanS20-RACE-KDE
  • Theorem 2
  • Lemma 3: Bounds for a single ensemble
  • Lemma 4
  • Lemma 5: Matrix Bernstein
  • Theorem 6: Kernel deviation with explicit constants
  • Lemma 7
  • Lemma 8: Bounding the bias term
  • Lemma 9: Row-sum and inverse diagonal control
  • Lemma 10: Concentration bound for $E$
  • ...and 3 more