Table of Contents
Fetching ...

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti

TL;DR

This work provides the most comprehensive, training-free evaluation of sparse attention for long-context Transformer LLMs to date, covering models from 7B to 72B parameters and sequence lengths up to 128K tokens. It reveals that, under isoFLOPS, large sparse models can outperform smaller dense ones for very long sequences, especially during decoding where higher sparsity is tolerable. Importantly, no single sparse method universally wins across all tasks and phases; performance is highly task- and phase-specific, underscoring the need for adaptive sparsity strategies and careful benchmarking. The paper also introduces scalable, generalizable sparse-attention laws and releases code to enable broader validation and deployment decisions in long-context settings.

Abstract

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

TL;DR

This work provides the most comprehensive, training-free evaluation of sparse attention for long-context Transformer LLMs to date, covering models from 7B to 72B parameters and sequence lengths up to 128K tokens. It reveals that, under isoFLOPS, large sparse models can outperform smaller dense ones for very long sequences, especially during decoding where higher sparsity is tolerable. Importantly, no single sparse method universally wins across all tasks and phases; performance is highly task- and phase-specific, underscoring the need for adaptive sparsity strategies and careful benchmarking. The paper also introduces scalable, generalizable sparse-attention laws and releases code to enable broader validation and deployment decisions in long-context settings.

Abstract

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.

Paper Structure

This paper contains 59 sections, 6 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of sparse attention methods for prefilling (left) and generation (right). These methods differ in the units of sparsification (blocks or pages vs. verticals and slashes), importance estimation, and KV cache management strategies. Colours represent query--key interactions preserved at different sparsity levels, while white areas indicate interactions that are not computed.
  • Figure 2: Performance comparison for batch size 1 across FLOPS, which are a function of sequence length, model size and sparsity level. We report 4 model sizes (markers) and compression ratios up to 20$\times$ (heatmap). Performance scores are aggregated across all 9 tasks. In the plots, we display two sequence lengths---32k (left) and 128k (right)---and two phases---prefilling (top) and decoding (bottom). Crucially, there is a phase transition where after a critical sequence length (32--64k tokens for Qwen family models), highly sparse and large models surpass dense and small models in performance for the same FLOPS budget. See \ref{['sec:flops_breakdown']} for details on how we estimate the FLOPS, including indexing costs for sparse attention methods.
  • Figure 3: Maximum compression ratio with statistically significant performance retention (y-axis) across different model sizes (colours) and sequence lengths (x-axis). Each point represents a task, with horizontal bars showing the average maximum compression across tasks and vertical bars indicating standard deviation. Left: Vertical-Slash pattern for prefilling. Right: Quest pattern for decoding. The key conclusion is that decoding tolerates higher compression than prefilling on average, with larger models maintaining performance even at very high compression ratios. However, almost every configuration has at least one task where maximum tolerable compression is below 5$\times$ (72B Quest being the only exception).
  • Figure 4: Performance comparison of different sparse attention methods across 9 tasks, aggregated over sequence lengths and models (shaded areas indicate the standard error). Top: prefilling methods (Vertical-Slash, FlexPrefill, Block-Sparse). Bottom: decoding methods (SnapKV, Ada-SnapKV, Quest). Each subplot shows the relationship between performance and compression for a specific task. The trade-off appears extremely task-dependent. Overall, Vertical-Slash performs best among prefilling methods, while Quest performs best among decoding methods.
  • Figure 5: Block-Sparse block size.
  • ...and 7 more figures