Table of Contents
Fetching ...

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

TL;DR

CompilerKV is proposed, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment that dominates SOTA methods under a 512-token budget.

Abstract

Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

TL;DR

CompilerKV is proposed, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment that dominates SOTA methods under a 512-token budget.

Abstract

Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.
Paper Structure (28 sections, 1 theorem, 22 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 22 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Consider a transformer layer $l$ with input sequence length $T$ and budget $B_l$. Let $A^{(l,h)}$ and $\tilde{A}^{(l,h)}$ denote the full and compressed attention matrices, respectively. Under CompilerKV's selection policy determined by the stabilized utility $\hat{u}_t^{(l)}$ and the head reliabili where $\epsilon_{\text{tail}}^{(l)}$ denotes the average attention mass of discarded tokens, and $W

Figures (6)

  • Figure 1: Overview of the CompilerKV Framework. The framework consists of three integrated stages: (1) computing a noise-resilient baseline score to filter out transient distractions; (2) modulating the ranking via a compiled Head Heterogeneity Table to strictly govern the functional differences among attention heads; and (3) querying a Risk-Gating Table to dynamically calibrate the retention threshold based on the prompt's inherent complexity.
  • Figure 2: Performance vs. KV Cache Size. Comparison of average accuracy on LongBench across different budget constraints. Our method (ComplierKV) degrades most gracefully, maintaining usability even at extreme compression ratios where baselines fail.
  • Figure 3: Needle-in-a-Haystack Pressure Test on Mistral-7B. Visual comparison of retrieval accuracy (Green=100%, Red=0%) across varying context lengths (x-axis) and needle depths (y-axis). FullKV (a) sets the upper bound. While baselines like StreamingLLM (b) and SnapKV (c) struggle with long-range dependencies, and DynamicKV (e) shows fragmentation at extreme lengths, our method CompilerKV (f) maintains a robust retrieval pattern comparable to the oracle.
  • Figure A1: Universality across Model Architectures. The average performance trend of our method on four different LLMs (LLaMA-3, Qwen2, InternLM-2.5, Mistral) under varying KV cache budgets (ranging from 64 to 1024). The consistent degradation patterns across diverse architectures demonstrate the generalization capability of our risk-adaptive mechanism.
  • Figure A3: Risk-Adaptive Threshold Gating Policy. The plots show the learned retention threshold ($\tau$) across layers for samples with varying risk levels (binned by Perplexity and Attention Entropy). For high-risk samples (High Perplexity, rightmost plots), the policy automatically lowers thresholds to preserve more information, validating our risk-adaptive mechanism.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 4.1: Stability-Oriented Attention Approximation Bound
  • proof