SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar; Ivan Chelombiev; Luke Hudlass-Galley; Charlie Blake; Carlo Luschi; Douglas Orr

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr

TL;DR

SparQ Attention targets the memory bandwidth bottleneck in autoregressive LLM inference by selectively fetching only the most relevant KV cache entries. It combines top-$k$ attention selection with mean-value reallocation and a query-sparsity heuristic, enabling substantial data-transfer reductions (up to $8\times$) while preserving task performance across diverse models. The method supports grouped-query attention, shows strong scalability to long sequences, and is validated through microbenchmarks and end-to-end tests on CPU and GPU, demonstrating practical inference speedups. Overall, SparQ offers a hardware-aware, plug-in improvement for pre-trained LLMs that maintains full context information and generalizes across model families and tasks, reducing latency in real-world deployments.

Abstract

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.

SparQ Attention: Bandwidth-Efficient LLM Inference

TL;DR

SparQ Attention targets the memory bandwidth bottleneck in autoregressive LLM inference by selectively fetching only the most relevant KV cache entries. It combines top-

attention selection with mean-value reallocation and a query-sparsity heuristic, enabling substantial data-transfer reductions (up to

) while preserving task performance across diverse models. The method supports grouped-query attention, shows strong scalability to long sequences, and is validated through microbenchmarks and end-to-end tests on CPU and GPU, demonstrating practical inference speedups. Overall, SparQ offers a hardware-aware, plug-in improvement for pre-trained LLMs that maintains full context information and generalizes across model families and tasks, reducing latency in real-world deployments.

Abstract

Paper Structure (49 sections, 17 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 49 sections, 17 equations, 19 figures, 5 tables, 1 algorithm.

Introduction
Background
Arithmetic intensity
Time in attention
Approximating Attention
Attention scores sparsity
Mean value reallocation
Query sparsity
Mean value reallocation with query sparsity
SparQ Attention
Grouped query attention
Experiments
Setup
Models
Tasks
...and 34 more sections

Figures (19)

Figure 1: Llama $2$$13$B SQuAD $1$-shot performance versus attention transfers over a range of compression ratios. SparQ Attention achieves matching performance, while transferring between $1/8$ and $1/4$ as much data as the original dense model. Line thickness shows $\pm$ one standard error over $4000$ examples (the uncertainty from a finite test set). This pattern is representative of the performance across various models and tasks, shown in \ref{['fig:app:tradeoff_grid_llama', 'fig:app:tradeoff_grid_misc', 'fig:app:tradeoff_grid_pythia']}.
Figure 2: Roofline analysis of Llama $2$$7$B on A$100$ ($40$GB), highlighting that for a range of LLM inference settings with batch size $B$ and sequence length $S$, practical performance is memory bandwidth bound.
Figure 3: The proportion of time that is spent in attention layers during Llama $2$$7$B inference with a single sample when using llama.cpp on both CPU and GPU platforms. For more details, see \ref{['sec:app:llama.cpp']}.
Figure 4: Statistics of Llama $2$$7$B, evaluated over $40$ SQuAD queries, over all $32$ layers $\times$$32$ heads unless noted. \ref{['fig:approximation_analysis:attention_scores_hist']} Sum softmax output allocated to the $32$ highest-scoring positions, demonstrating natural attention sparsity; \ref{['fig:approximation_analysis:attention_scores_heatmap']} for each head. \ref{['fig:approximation_analysis:query_hist']} Kernel density estimate rosenblatt1956kde of components of $\boldsymbol{q}$ in layer $16$, showing heavy tails. \ref{['fig:approximation_analysis:query_kurtosis_strip']} Fisher Kurtosis of $\bm{q}$ components, for each head, showing that the query vector is leptokurtic for most heads. \ref{['fig:approximation_analysis:agreement_rk_violin']} Top-$k$ agreement between approximate and true scores for multiple values of $r$ selected from query vector. Top-$k$ agreement is the proportion of the top-$k$ positions that are correctly predicted by an approximated softmax, using a projection of $\boldsymbol{q}$. \ref{['fig:approximation_analysis:reallocation_scale_scatter']} Agreement between the coverage $\alpha$ based on estimated scores versus the true mass of the top $128$ scores, for different softmax temperatures (a point for each example $\times$ head), showing the importance of correct temperature. Further analysis is presented in \ref{['sec:app:analysis']}.
Figure 5: SparQ Attention
...and 14 more figures

SparQ Attention: Bandwidth-Efficient LLM Inference

TL;DR

Abstract

SparQ Attention: Bandwidth-Efficient LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (19)