Efficient Sparse Attention needs Adaptive Token Release

Chaoran Zhang; Lixin Zou; Dan Luo; Min Tang; Xiangyang Luo; Zihao Li; Chenliang Li

Efficient Sparse Attention needs Adaptive Token Release

Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li

TL;DR

This work tackles the quadratic cost of key-value cache growth in decoder-based LLMs by introducing ADORE, an adaptive token release framework that maintains a fixed KV cache while approximating dynamic top-$K$ sparse attention. A lightweight GRU-based controller selects low-contribution tokens to retain or drop, and a KV states rebuilding mechanism recovers information from released tokens that may be needed later, with matrix slicing accelerating implementation. Across NLG, streaming, and modeling tasks on a 7B-Llama backbone, ADORE achieves substantial throughput gains (up to $221.8\%$ over full attention) while preserving text quality, surpassing many existing sparse attention baselines. The approach is compatible with existing LLM inference stacks and demonstrates practical gains for long-context decoding and real-time applications, with limitations primarily in the fine-tuning requirement and initial $O(n^2)$ costs during training.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide array of text-centric tasks. However, their `large' scale introduces significant computational and storage challenges, particularly in managing the key-value states of the transformer, which limits their wider applicability. Therefore, we propose to adaptively release resources from caches and rebuild the necessary key-value states. Particularly, we accomplish this by a lightweight controller module to approximate an ideal top-$K$ sparse attention. This module retains the tokens with the highest top-$K$ attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.

Efficient Sparse Attention needs Adaptive Token Release

TL;DR

sparse attention. A lightweight GRU-based controller selects low-contribution tokens to retain or drop, and a KV states rebuilding mechanism recovers information from released tokens that may be needed later, with matrix slicing accelerating implementation. Across NLG, streaming, and modeling tasks on a 7B-Llama backbone, ADORE achieves substantial throughput gains (up to

over full attention) while preserving text quality, surpassing many existing sparse attention baselines. The approach is compatible with existing LLM inference stacks and demonstrates practical gains for long-context decoding and real-time applications, with limitations primarily in the fine-tuning requirement and initial

costs during training.

Abstract

sparse attention. This module retains the tokens with the highest top-

attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.

Paper Structure (30 sections, 4 equations, 9 figures, 8 tables)

This paper contains 30 sections, 4 equations, 9 figures, 8 tables.

Introduction
Methodology
Efficient Sparse Transformer
Adaptive Token Release
KV States Rebuild
Matrix Slicing as Multiplication
Experiment
Experimental Settings
Natural Language Generation
Stream Generation
Natural Language Modeling
Ablation Study
Influence of Attention Sparisity
Influence of KV States Rebuild
Effectiveness of Controller Module
...and 15 more sections

Figures (9)

Figure 1: An illustration of the conflict of releasing key-value (KV) states in advance during the inference. Consider a cache size of 3. At step N, the KV states associated with the word 'profoundly' are released from the cache. Consequently, in the subsequent step N+1, the 'profoundly' state is absent from the cache, despite having a higher attention score for 'in'.
Figure 2: The controller module calculates the importance of all input and generated tokens for the current token. The Key-Value (KV) cache maintains the states of $m$ tokens with the highest importance. For tokens that were previously released from the cache, those with the top-$R$ highest importance are concurrently modeled alongside the current token.
Figure 3: Performance comparison in terms of throughput for generating different text lengths.
Figure 4: Performance comparison on the StreamEval at various query times.
Figure 5: Perplexity evaluation on CNN DM and SAMsum across different lengths.
...and 4 more figures

Efficient Sparse Attention needs Adaptive Token Release

TL;DR

Abstract

Efficient Sparse Attention needs Adaptive Token Release

Authors

TL;DR

Abstract

Table of Contents

Figures (9)