Table of Contents
Fetching ...

Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference

Qingfa Xiao, Jiachuan Wang, Haoyang Li, Cheng Deng, Jiaqi Tang, Shuangyin Li, Yongqi Zhang, Jun Wang, Lei Chen

TL;DR

ActQKV tackles the inefficiency of KV retrieval in long-context LLM inference by introducing a training-free Activation-aware Probe-Query that emphasizes anchor tokens within each sliding window. It couples this with a Dynamic KV Cut-off that allocates retrieval budget across transformer layers based on information density, yielding more relevant KV recall under a modest KV budget. Empirical results on Long-Bench and ∞-Bench demonstrate state-of-the-art performance with substantial KV-reduction (2K budget) and improved factual retrieval, while maintaining competitive quality and resource efficiency. The approach advances long-context reasoning by combining activation-guided context representation with principled layer-wise budget allocation, enabling scalable and reliable KV-based retrieval in autoregressive LLMs.

Abstract

Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbf{ActQKV}, a training-free, \textbf{Act}ivation-aware approach that dynamically determines probe-\textbf{Q}uery and leverages it to retrieve the relevant \textbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and $\infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.

Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference

TL;DR

ActQKV tackles the inefficiency of KV retrieval in long-context LLM inference by introducing a training-free Activation-aware Probe-Query that emphasizes anchor tokens within each sliding window. It couples this with a Dynamic KV Cut-off that allocates retrieval budget across transformer layers based on information density, yielding more relevant KV recall under a modest KV budget. Empirical results on Long-Bench and ∞-Bench demonstrate state-of-the-art performance with substantial KV-reduction (2K budget) and improved factual retrieval, while maintaining competitive quality and resource efficiency. The approach advances long-context reasoning by combining activation-guided context representation with principled layer-wise budget allocation, enabling scalable and reliable KV-based retrieval in autoregressive LLMs.

Abstract

Recent advances in large language models (LLMs) have showcased exceptional performance in long-context tasks, while facing significant inference efficiency challenges with limited GPU memory. Existing solutions first proposed the sliding-window approach to accumulate a set of historical \textbf{key-value} (KV) pairs for reuse, then further improvements selectively retain its subsets at each step. However, due to the sparse attention distribution across a long context, it is hard to identify and recall relevant KV pairs, as the attention is distracted by massive candidate pairs. Additionally, we found it promising to select representative tokens as probe-Query in each sliding window to effectively represent the entire context, which is an approach overlooked by existing methods. Thus, we propose \textbf{ActQKV}, a training-free, \textbf{Act}ivation-aware approach that dynamically determines probe-\textbf{Q}uery and leverages it to retrieve the relevant \textbf{KV} pairs for inference. Specifically, ActQKV monitors a token-level indicator, Activation Bias, within each context window, enabling the proper construction of probe-Query for retrieval at pre-filling stage. To accurately recall the relevant KV pairs and minimize the irrelevant ones, we design a dynamic KV cut-off mechanism guided by information density across layers at the decoding stage. Experiments on the Long-Bench and Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.

Paper Structure

This paper contains 25 sections, 12 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualization of query vector status within probe-Query compared between ActQKV and InfLLM: "Who is Sobe (Sister of Saint Anne)’s Grandchild?". We simply display the states of 15 tokens from a window of size 256 in the last transformer layer. The probe-Query generated by our ActQKV aligns more closely with the SOTA embedding model BGE-M3 chen-etal-2024-m3. In contrast, InfLLM generates evenly distributed similarities across the context, neglecting the prioritization of anchor tokens compared to our approach.
  • Figure 2: Illustration of our ActQKV. Sliding window attention stores historical KV pairs in a cache and reuses them for subsequent window inference. Based on this, ActQKV first identifies the anchor tokens within the window and then constructs the activation-aware probe-Query. This probe-Query is subsequently used to retrieve the top-k relevant KV pairs from the cache during the pre-filling stage. During the decoding stage, the cut-off mechanism dynamically adjusts the number of recalled KV pairs based on the distribution of key-values at each layer, ensuring the inclusion of relevant pairs while minimizing the influence of irrelevant ones. The cache can be stored in the CPU and transferred to the GPU when needed. All our contributions are highlighted in red.
  • Figure 3: Analysis of the top-$k$ (avg. k=1,472) most relevant KV pairs for each inference step across layers. We randomly select 50 samples from Long-Bench and filter out those with a length less than 8K. In each layer, we calculate 35,180 similarity scores generated by our ActQKV and InfLLM respectively. Each score is calculated based on a probe-Query and a chunk containing 32 KV pairs. The average perplexity is calculated based on the perplexity within the scores of each sample.
  • Figure 4: Long-Bench longbench.
  • Figure 5: $\infty$-Bench infinitebench.
  • ...and 1 more figures