Replacing Softmax Similarity with a Sharpened Angular Similarity: Theory and Practice of Scaling To Billion-Context Attention
Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, Anshumali Shrivastava
TL;DR
This work targets the quadratic bottleneck of Softmax Attention in long-context transformers by introducing RACE Attention, a linear-time attention mechanism built on a sharpened angular (cosine) similarity and randomized LSH-based sketches. It provides a principled RandNLA-informed analysis, showing how per-head sketch parameters control bias-variance trade-offs, and demonstrates through extensive experiments that RACE matches strong baselines on standard tasks while scaling to tens of millions of tokens on CPU and GPU. Key contributions include the angular kernel formulation, a three-stage linear-time algorithm, a rigorous error bound, and comprehensive scaling results that outperform state-of-the-art attention implementations at extreme context lengths. The practical impact is a viable path to billion-token contexts on commodity hardware, with potential extensions to inference-only use and GPU-accelerated kernels.
Abstract
Softmax Attention has a quadratic time complexity, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention (an exact, GPU-optimized implementation of Softmax Attention) cannot complete a single forward-backward pass of a multi-head attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular (cosine) similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text classification, RACE Attention matches the accuracy of strong baselines while reducing runtime and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU, well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical, theoretically grounded mechanism for outrageously long context windows on today's hardware. We hope that it gets adopted in practice.
