SparQ Attention: Bandwidth-Efficient LLM Inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
TL;DR
SparQ Attention targets the memory bandwidth bottleneck in autoregressive LLM inference by selectively fetching only the most relevant KV cache entries. It combines top-$k$ attention selection with mean-value reallocation and a query-sparsity heuristic, enabling substantial data-transfer reductions (up to $8\times$) while preserving task performance across diverse models. The method supports grouped-query attention, shows strong scalability to long sequences, and is validated through microbenchmarks and end-to-end tests on CPU and GPU, demonstrating practical inference speedups. Overall, SparQ offers a hardware-aware, plug-in improvement for pre-trained LLMs that maintains full context information and generalizes across model families and tasks, reducing latency in real-world deployments.
Abstract
The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
