Table of Contents
Fetching ...

HSR-Enhanced Sparse Attention Acceleration

Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song

TL;DR

This work targets the bottleneck of attention in long-context LLMs by exploiting sparsity with Half-Space Reporting (HSR). It shows how to identify and compute only the massively activated or nonzero entries for both Softmax and ReLU attention, achieving runtime reductions such as $O(mn^{4/5})$ for generation decoding and $O(mn^{1-1/\lfloor d/2\rfloor}+mn^{4/5})$ for prompt prefilling, with provably negligible Softmax error under mild assumptions. The approach leverages HSR to rapidly report high-impact index sets and provides detailed sparsity and error analyses alongside empirical validation on prominent models. The framework includes two concrete pipelines (generation decoding with fixed keys and prompt prefilling with dynamic keys) and rigorous runtime guarantees, bridging theory and potential practical speedups for long-context transformers. Overall, this work advances efficient long-context processing by marrying geometric data structures with attention sparsity, offering practical implications for latency and throughput in large-scale language models.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with $\mathsf{ReLU}^α$ activation, $α\in \mathbb{N}_+$), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of $O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for generation decoding, where $n$ is the context length, $m$ is the query length, and $d$ is the hidden dimension. We can also reduce the running time for prompt prefilling from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} + mn^{4/5})$. Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.

HSR-Enhanced Sparse Attention Acceleration

TL;DR

This work targets the bottleneck of attention in long-context LLMs by exploiting sparsity with Half-Space Reporting (HSR). It shows how to identify and compute only the massively activated or nonzero entries for both Softmax and ReLU attention, achieving runtime reductions such as for generation decoding and for prompt prefilling, with provably negligible Softmax error under mild assumptions. The approach leverages HSR to rapidly report high-impact index sets and provides detailed sparsity and error analyses alongside empirical validation on prominent models. The framework includes two concrete pipelines (generation decoding with fixed keys and prompt prefilling with dynamic keys) and rigorous runtime guarantees, bridging theory and potential practical speedups for long-context transformers. Overall, this work advances efficient long-context processing by marrying geometric data structures with attention sparsity, offering practical implications for latency and throughput in large-scale language models.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with activation, ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of significantly faster than the naive approach for generation decoding, where is the context length, is the query length, and is the hidden dimension. We can also reduce the running time for prompt prefilling from to . Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.

Paper Structure

This paper contains 35 sections, 25 theorems, 43 equations, 3 figures, 1 table, 3 algorithms.

Key Result

Corollary 3.1

Let ${\cal T}_{\mathsf{init}}$ denote the pre-processing time to build the data structure, ${\cal T}_{\mathsf{query}}$ denote the time per query, and ${\cal T}_{\mathsf{update}}$ time per update. Given a set of $n$ points in $\mathbb{R}^d$, the half-space range reporting problem can be solved with t

Figures (3)

  • Figure 1: The trending of the Softmax activation ($\exp$) and the ReLU activation with different powers. Here, we choose $b = 1.5$ as the threshold for the ReLU activation.
  • Figure 2: An outline of our principal algorithms. Top: Algorithm \ref{['alg:relu_attn_gen']} for generation decoding is depicted, with the key matrix $K$ is fixed. During each inference step, the input query $Q$ interacts with the HSR data structure to get the activated indices set $\widetilde{S}_{i, j}$. Then, we can calculate the attention matrix according to $\widetilde{S}_{i, j}$. Bottom: Algorithm \ref{['alg:calculation_general_framework']} for prompt prefilling is shown, where both the key matrix $K$ and the query matrix $Q$ are variable across iterations. Consequently, the HSR data structure must first be initialized with $K$, followed by querying it using $Q$. Finally, according to the activated entries set $\widetilde{S}_{i, j}$ reported by the HSR data structure, the attention matrix can be calculated. For more information, please refer to Remark \ref{['rem:difference_of_alg2_and_alg3']}.
  • Figure 3: We evaluated the perplexity of three mainstream language models: LLaMA 3.1 8B Instruct, Mistral Nemo 12B, and Phi 3.5 Mini 3.8B Instruct, using Softmax attention with top-$r$ indices on the PaulGrahamEssays dataset. The results indicate a significant increase in perplexity only when the number of selected entries, $r$, falls below $2^4$. This observation aligns with our earlier findings that the proportion of "massive activated" entries is minimal compared to the total number of entries. Furthermore, the approximation error introduced by using top-$r$ indices in Softmax attention remains negligible unless $r$ becomes excessively small.

Theorems & Definitions (46)

  • Definition 1.1: Softmax attention
  • Definition 1.2: ReLU attention
  • Corollary 3.1: HSR data-structure time complexity aem92, informal version of Corollary \ref{['cor:hsr_running_time']}
  • Theorem 4.1: Running time of ReLU attention generation decoding, informal version of Theorem \ref{['thm:relu_gen_running_time']}
  • Theorem 4.2: Running time of Softmax attention generation decoding, informal version of Theorem \ref{['thm:Softmax_attention_generation']}
  • Theorem 4.3: Error analysis of Softmax attention with index set, informal version of Theorem \ref{['thm:err_analysis_of_Softmax_attn_with_index_set']}
  • Remark 4.4
  • Theorem 5.1: Running time of ReLU attention prompt prefilling, informal version of Theorem \ref{['thm:relu_cal_running_time']}
  • Theorem 5.2: Running time of Softmax attention prompt prefilling, informal version of Theorem \ref{['thm:Softmax_attention_computation']}
  • Lemma 6.1: Sparsity analysis, informal version of Lemma \ref{['lem:sparsity_analysis']}
  • ...and 36 more