Table of Contents
Fetching ...

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Nikita Surkov, Ivan Ermakov, Dan Alistarh

TL;DR

The paper addresses the memory bottleneck of Key-Value caches in long-context LLMs by introducing AQUA-KV, an adaptive KV-cache quantization framework that leverages inter-layer and intra-layer dependencies through compact predictors and residual quantization. It employs a one-shot calibration procedure, training predictors sequentially on reconstructed past caches, and remains agnostic to the backbone quantizer. Empirically, AQUA-KV delivers substantial memory reductions with minimal accuracy loss across Llama 3.x and Qwen 2.5, achieving near-lossless inference at 2-2.5 bits per value for 70B models and maintaining strong LongBench perplexity performance. The method remains compatible with pruning techniques and is simple enough to calibrate on a single GPU in a few hours, offering a practical solution for memory-efficient LLM inference and potential insights into attention redundancy. Overall, AQUA-KV provides a scalable, predictor-based enhancement to KV-cache compression that improves the efficiency-accuracy trade-off in real-world deployments.

Abstract

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

TL;DR

The paper addresses the memory bottleneck of Key-Value caches in long-context LLMs by introducing AQUA-KV, an adaptive KV-cache quantization framework that leverages inter-layer and intra-layer dependencies through compact predictors and residual quantization. It employs a one-shot calibration procedure, training predictors sequentially on reconstructed past caches, and remains agnostic to the backbone quantizer. Empirically, AQUA-KV delivers substantial memory reductions with minimal accuracy loss across Llama 3.x and Qwen 2.5, achieving near-lossless inference at 2-2.5 bits per value for 70B models and maintaining strong LongBench perplexity performance. The method remains compatible with pruning techniques and is simple enough to calibrate on a single GPU in a few hours, offering a practical solution for memory-efficient LLM inference and potential insights into attention redundancy. Overall, AQUA-KV provides a scalable, predictor-based enhancement to KV-cache compression that improves the efficiency-accuracy trade-off in real-world deployments.

Abstract

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Paper Structure

This paper contains 24 sections, 6 figures, 15 tables, 4 algorithms.

Figures (6)

  • Figure 1: Comparison of AQUA-KV to alternative Key-Value Cache compression methods for Llama 3.x models in terms of average LongBench score on 14 english tasks (see Section \ref{['sect:experiments']}).
  • Figure 2: Mean Explained Variance Ratios by linear probes from previous blocks (L), tokens (T) and role on Llama-3.2-3B.
  • Figure 3: An intuitive scheme of the AQUA-KV inference. Only the quantized residuals are saved for each block.
  • Figure 4: Additional Mean Explained Variance Ratios by linear probes from previous blocks (L), tokens (T) and role on Llama-3.2-3B.
  • Figure 5: Explained Variance Ratios per Transformer Block for chosen sets of linear probes on Llama-3.2-3B.
  • ...and 1 more figures