Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
TL;DR
Judge Q addresses the KV cache eviction bottleneck in long-context LLMs by introducing trainable soft tokens that are appended to inputs during pre-fill. By training only the embedding layer to align soft-token attention with that of actual decoded tokens, the method captures global information and improves the quality of eviction decisions without extensive fine-tuning. Experiments on Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3 across LongBench, RULER, and Needle-in-a-Haystack show consistent wins over traditional eviction baselines under the same budget, with gains of about 1 point on LongBench and over 3 points on RULER, and sometimes approaching the Full KV upper bound. The approach is easy to adopt in open-source models and offers practical benefits for memory-limited deployments, with analyses highlighting data quality, soft-token count, and global-information advantages as key factors. Overall, Judge Q provides a low-cost, effective means to preserve decoding quality during KV-cache eviction by leveraging globally informed soft tokens during pre-fill and pruning.
Abstract
Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
