Table of Contents
Fetching ...

Boosting Long-Context Management via Query-Guided Activation Refilling

Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian

TL;DR

Long-context processing in LLMs is challenged by context-window limits and the $O(n^2)$ cost of full attention due to massive KV activations. The paper introduces ACRE, which uses a bi-layer KV cache (L1 global, L2 local) and a query-guided activation refilling mechanism to dynamically inject task-relevant local details into a compact global representation, with a two-stage optimization to train the system. Empirical results on 12 long-context information-seeking datasets show improvements in both performance and efficiency over strong baselines, across models. This approach enables accurate, scalable long-context reasoning for information-seeking applications while reducing memory and compute demands.

Abstract

Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.

Boosting Long-Context Management via Query-Guided Activation Refilling

TL;DR

Long-context processing in LLMs is challenged by context-window limits and the cost of full attention due to massive KV activations. The paper introduces ACRE, which uses a bi-layer KV cache (L1 global, L2 local) and a query-guided activation refilling mechanism to dynamically inject task-relevant local details into a compact global representation, with a two-stage optimization to train the system. Empirical results on 12 long-context information-seeking datasets show improvements in both performance and efficiency over strong baselines, across models. This approach enables accurate, scalable long-context reasoning for information-seeking applications while reducing memory and compute demands.

Abstract

Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context-window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query's information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to these dynamic information needs. In the paper, we propose a method for processing long-context information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed and localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thus improving answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE's effectiveness, achieving improvements in both performance and efficiency.

Paper Structure

This paper contains 16 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of ACRE, standard RAG, and efficient long LLMs for information-seeking tasks. Standard RAG retrieves evidence without full-context perception, and long LLMs struggle with contexts exceeding their native window. ACRE overcomes these limitations with a resource-efficient bi-layer KV cache and query-guided refilling, capturing both global and local information while enhancing performance.
  • Figure 2: Overview of ACRE. (a) ACRE constructs the Bi-layer KV cache from a long context. (b) For an input query, ACRE refills the L1 KV cache with query-relevant entries from the L2 KV cache and decodes the final answer based on the refilled cache. (c) The two-stage optimization process used to train ACRE is illustrated.
  • Figure 3: Ablation Study on Model Design Variations Across Different LLMs.
  • Figure 4: Analysis of the maximum refilling length $\eta$ (left) and the impact of the L1/L2 interval $l$ (right).