Table of Contents
Fetching ...

KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

Vivek Chari, Guanghui Qin, Benjamin Van Durme

TL;DR

KV-Distill tackles the memory bottleneck of long-context conditioning in LLMs by learning a general, question-independent compression of the KV-cache. It selects important tokens and routes them through LoRA-adapted adapters to produce a compact yet expressive condensed cache, trained via a forward–reverse KL distillation objective that aligns next-token distributions with the uncompressed model. The method achieves substantial memory reductions (up to 1000x in principle) with minimal downstream performance loss across extractive and abstractive tasks and model scales, and can be finetuned on domain-specific contexts for ultra-high compression. This approach offers practical, generalizable gains for memory-efficient generation in real-world long-context applications, with released distilled checkpoints to facilitate adoption.

Abstract

Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

KV-Distill: Nearly Lossless Learnable Context Compression for LLMs

TL;DR

KV-Distill tackles the memory bottleneck of long-context conditioning in LLMs by learning a general, question-independent compression of the KV-cache. It selects important tokens and routes them through LoRA-adapted adapters to produce a compact yet expressive condensed cache, trained via a forward–reverse KL distillation objective that aligns next-token distributions with the uncompressed model. The method achieves substantial memory reductions (up to 1000x in principle) with minimal downstream performance loss across extractive and abstractive tasks and model scales, and can be finetuned on domain-specific contexts for ultra-high compression. This approach offers practical, generalizable gains for memory-efficient generation in real-world long-context applications, with released distilled checkpoints to facilitate adoption.

Abstract

Sequence-to-sequence tasks often benefit from long contexts, but the quadratic complexity of self-attention in standard Transformers renders this non-trivial. During generation, temporary representations -stored in the so-called KV cache-account for a large portion of GPU memory usage and scale linearly with context length. We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations in a question-independent fashion. KV-Distill can be trained as a parameter-efficient adaptor for pretrained models, and enables the compression of arbitrary spans of a context while preserving pre-trained model capabilities. We treat a compressed-uncompressed cache as a student-teacher pairing and apply a KL-type divergence to match the generated outputs. KV-Distill outperforms other compression techniques in worst-case extractive tasks and approaches uncompressed performance in long context question answering and summarization, and it can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

Paper Structure

This paper contains 22 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We subselect tokens from the kv cache and distill into the smaller subset
  • Figure 2: Selected tokens are routed to trainable, LoRA-adapted $\vec{W}^\text{Q}$ and $\vec{W}^\text{O}$ matrices ($\vec{W}^\text{O}$ is omitted in this figure); all other tokens pass through the original (frozen) model parameters.
  • Figure 3: Needle-in-a-Haystack results; The $x$-axis shows the length of the document, the $y$-axis indicates the compression ratio applied, and the color the accuracy of retrieval under those settings averaged across different locations in the document. Left:$\mathsf{H_2I}$. Right:kv-distill .
  • Figure 4: QuALITY accuracy against compression.
  • Figure 5: Rouge-L on SQuALITY
  • ...and 3 more figures