Table of Contents
Fetching ...

CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis

TL;DR

Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding, and system-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity.

Abstract

Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.

CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

TL;DR

Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding, and system-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity.

Abstract

Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.
Paper Structure (47 sections, 5 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 47 sections, 5 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Context-agnostic vs. context-aware KV selection. Red indicates critical tokens, while grey denotes ignored tokens. (a) Context-agnostic ( e.g., SnapKV): Preserve the most important tokens based on attention scores. (b) Context-aware (Ours): Adaptively select segments that are semantically relevant to the current generation and retain local context.
  • Figure 2: Escalating latency and dominant attention computation in long-context decoding.
  • Figure 3: Overview of the CHESS System Architecture. The system maintains a hierarchical view (Grid, Chunk, Page) over the physical KV cache to enable context-aware selection. The final context is reconstructed by combining semantically selected pages with attention sinks and the local query window.
  • Figure 4: Distribution of Average Entropy vs. Varentropy per KV Cache Page (Calibration Phase). This plot illustrates the density of pages from the calibration dataset, where warmer colors indicate higher concentration. Dashed lines represent the selected 99th percentile thresholds. The shaded upper-right region highlights the pruned area, corresponding to high-uncertainty outliers excluded by our method.
  • Figure 5: Normalized accuracy on long-context tasks. Axes represent relative performance scaled to the maximum score per domain. Abbreviations: L-ICL: Long In-context Learning; Struct: Structured Data; Dialog: Dialogue History; Multi/Single-QA: Multi/Single-Doc QA.
  • ...and 5 more figures