Efficient Sparse Attention needs Adaptive Token Release
Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li
TL;DR
This work tackles the quadratic cost of key-value cache growth in decoder-based LLMs by introducing ADORE, an adaptive token release framework that maintains a fixed KV cache while approximating dynamic top-$K$ sparse attention. A lightweight GRU-based controller selects low-contribution tokens to retain or drop, and a KV states rebuilding mechanism recovers information from released tokens that may be needed later, with matrix slicing accelerating implementation. Across NLG, streaming, and modeling tasks on a 7B-Llama backbone, ADORE achieves substantial throughput gains (up to $221.8\%$ over full attention) while preserving text quality, surpassing many existing sparse attention baselines. The approach is compatible with existing LLM inference stacks and demonstrates practical gains for long-context decoding and real-time applications, with limitations primarily in the fine-tuning requirement and initial $O(n^2)$ costs during training.
Abstract
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide array of text-centric tasks. However, their `large' scale introduces significant computational and storage challenges, particularly in managing the key-value states of the transformer, which limits their wider applicability. Therefore, we propose to adaptively release resources from caches and rebuild the necessary key-value states. Particularly, we accomplish this by a lightweight controller module to approximate an ideal top-$K$ sparse attention. This module retains the tokens with the highest top-$K$ attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.
