Table of Contents
Fetching ...

KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Łańcucki

TL;DR

KVTC introduces a transform-coding approach to compress large KV caches used in LLM inference. It performs a one-time calibration to learn a PCA-like basis, followed by DP-guided perceptual-like bit allocation and lossless entropy coding, enabling up to about 20× average compression with negligible accuracy loss and over 40× in some scenarios. The method preserves model parameters, supports reuse across turns, and complements existing cache-management strategies, offering a practical path to memory-efficient, scalable LLM serving. Across multiple models and benchmarks, KVTC consistently outperforms inference-time baselines such as eviction, low-rank, and simple quantization methods while delivering favorable latency characteristics.

Abstract

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

KV Cache Transform Coding for Compact Storage in LLM Inference

TL;DR

KVTC introduces a transform-coding approach to compress large KV caches used in LLM inference. It performs a one-time calibration to learn a PCA-like basis, followed by DP-guided perceptual-like bit allocation and lossless entropy coding, enabling up to about 20× average compression with negligible accuracy loss and over 40× in some scenarios. The method preserves model parameters, supports reuse across turns, and complements existing cache-management strategies, offering a practical path to memory-efficient, scalable LLM serving. Across multiple models and benchmarks, KVTC consistently outperforms inference-time baselines such as eviction, low-rank, and simple quantization methods while delivering favorable latency characteristics.

Abstract

Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20 compression while maintaining reasoning and long-context accuracy, and 40 or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Paper Structure

This paper contains 64 sections, 10 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: The kvtc transform-coding pipeline. Features are linearly decorrelated via PCA, and the resulting PCA coefficients are quantized using variable bit widths. The PCA basis $V$ is computed once on a calibration dataset and reused for all caches. Key and value caches are compressed separately.
  • Figure 3: Cosine similarity before and after alignment between key (a) and value (b) heads calculated using Llama 3.1 8B on inputs from Qasper dasigi2021datasetinformationseekingquestionsanswersshaham2022scrollsstandardizedcomparisonlong. For each example, we calculate cosine similarity between all keys/values from the same position and then average across the batch. Orthonormal alignment matrices were produced using 20 samples from the RedPajama v2 weber2024redpajama.
  • Figure 4: Ablation of kvtc with compression ratio 64$\times$ on Llama 3.1 8B: (a) compression disabled for attention sink tokens; (b) compression disabled for the final 128 tokens. All other settings are fixed. Additional ablations are provided in \ref{['sec:appendix:ablate_sink', 'sec:appendix:sec-kvtc-ws-ablate']}.
  • Figure 5: A high-level architecture of KV-cache-aware LLM serving environment.
  • Figure 6: Calibration of Llama 3.1 8B with kvtc. Left: Reconstruction error as a function of the size of calibration set. The arrow $A\rightarrow B$ denotes fitting PCA on dataset A and calculating the error on B. Middle: The reconstruction error as a function of position in the context. The error is higher for the sink tokens. Right: Bit assignment computed via dynamic programming, counting in the per-group scaling factors.
  • ...and 4 more figures