KV-Distill: Nearly Lossless Learnable Context Compression for LLMs
Vivek Chari, Guanghui Qin, Benjamin Van Durme
TL;DR
KV-Distill tackles the memory bottleneck of long-context conditioning in LLMs by learning a general, question-independent compression of the KV-cache. It selects important tokens and routes them through LoRA-adapted adapters to produce a compact yet expressive condensed cache, trained via a forward–reverse KL distillation objective that aligns next-token distributions with the uncompressed model. The method achieves substantial memory reductions (up to 1000x in principle) with minimal downstream performance loss across extractive and abstractive tasks and model scales, and can be finetuned on domain-specific contexts for ultra-high compression. This approach offers practical, generalizable gains for memory-efficient generation in real-world long-context applications, with released distilled checkpoints to facilitate adoption.
Abstract
Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.
