Table of Contents
Fetching ...

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

TL;DR

This work tackles the challenge of fixed-budget sparsity in attention for long-context LLMs by introducing Twilight, a framework that adds adaptive, top-$p$ based pruning atop existing sparse attention methods. By selecting a minimal set of tokens whose attention weights exceed a threshold, Twilight dynamically adjusts budgets to head distributions and prompts, enabling substantial speedups with negligible accuracy loss. The approach is implemented as a hierarchical select-then-prune pipeline with efficient GPU kernels and INT4 KV-cache quantization, achieving up to 98% token pruning and significant improvements in self-attention and end-to-end latency. Empirical results across multiple benchmarks and models demonstrate consistent accuracy retention and notable efficiency gains, validating adaptive sparsity as a practical path for long-context inference.

Abstract

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

TL;DR

This work tackles the challenge of fixed-budget sparsity in attention for long-context LLMs by introducing Twilight, a framework that adds adaptive, top- based pruning atop existing sparse attention methods. By selecting a minimal set of tokens whose attention weights exceed a threshold, Twilight dynamically adjusts budgets to head distributions and prompts, enabling substantial speedups with negligible accuracy loss. The approach is implemented as a hierarchical select-then-prune pipeline with efficient GPU kernels and INT4 KV-cache quantization, achieving up to 98% token pruning and significant improvements in self-attention and end-to-end latency. Empirical results across multiple benchmarks and models demonstrate consistent accuracy retention and notable efficiency gains, validating adaptive sparsity as a practical path for long-context inference.

Abstract

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top- sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to acceleration in self-attention operations and acceleration in end-to-end per token latency in long context LLM decoding.

Paper Structure

This paper contains 22 sections, 4 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of top-$k$ and top-$p$ for attention sparsity. Approximate attention typically employs techniques such as pooling, channel pruning, and quantization to approximate the query ($\tilde{Q}$) and key ($\tilde{K}$) and estimate the attention weights. These weights are then used to select important tokens for sparse attention. (a) Top-$k$ sparsity, utilized by most existing designs, relies on a fixed $k$-token budget and often results in over-selection ($\sum \tilde{p_i}>$ 0.9) or under-selection ($\sum \tilde{p_i}<$ 0.9). (b) Our proposed top-$p$ sparsity dynamically adjusts the budget to accumulate just sufficient attention weights ($\sum \tilde{p_i} =$ 0.9), enabling more efficient and adaptive sparse attention.
  • Figure 2: Relationship between the KV cache budget and the perplexity on the PG-19 dataset in different top-$k$ sparse attention methods.
  • Figure 3: Diverse distributions observed in attention weights. The leftmost image illustrates a "flat" distribution (diffuse attention), where the weights are close to uniformly distributed. The middle image depicts a "peaked" distribution (focused attention), where the weights are concentrated on the tokens at the two sides. When overlaid as in the rightmost image, the differences between these distributions become readily apparent.
  • Figure 4: Cumulative attention scores of different budget selections in one example attention head.
  • Figure 5: Twilight architecture. Twilight incorporates a certain existing base sparse attention algorithm and further optimizes it. It computes self-attention in three steps. First, the Token Selector selects critical tokens using the base algorithm under a conservative budget. Then, the Twilight Pruner prunes the selected token subset via top-$p$ thresholding. Finally, the pruned token indices are passed to the Sparse Attention Kernel to perform the attention computation.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Definition 3.1: Sparse Attention
  • Definition 3.2: Oracle Top-$k$ Sparse Attention
  • Definition 3.3: Oracle Top-$p$ Sparse Attention