Table of Contents
Fetching ...

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

TL;DR

KV cache memory presents a substantial bottleneck for large language model inference. The authors introduce Coupled Quantization (CQ), a channel-coupled KV-cache quantization that jointly encodes groups of key/value channels to exploit inter-channel dependencies, aided by Fisher-guided centroid learning. The approach achieves strong model quality preservation at low bit-widths, including 1-bit quantization, and demonstrates competitive performance across multiple models and benchmarks with reduced memory traffic. These results suggest CQ enables scalable, memory-efficient inference for large-scale autoregressive models in practical deployment scenarios.

Abstract

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

TL;DR

KV cache memory presents a substantial bottleneck for large language model inference. The authors introduce Coupled Quantization (CQ), a channel-coupled KV-cache quantization that jointly encodes groups of key/value channels to exploit inter-channel dependencies, aided by Fisher-guided centroid learning. The approach achieves strong model quality preservation at low bit-widths, including 1-bit quantization, and demonstrates competitive performance across multiple models and benchmarks with reduced memory traffic. These results suggest CQ enables scalable, memory-efficient inference for large-scale autoregressive models in practical deployment scenarios.

Abstract

Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
Paper Structure (16 sections, 6 equations, 8 figures, 5 tables)

This paper contains 16 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Growth rate of joint entropy versus sum of marginal entropies of the LLaMA-7b key/value activation embeddings on 262k tokens of WikiText-2. Entropy is estimated using Equation \ref{['eq:entropy']}. The slower growth rate of joint entropy implies that jointly quantizing more channels requires fewer bits than quantizing each channel independently.
  • Figure 2: Correlation matrices of the first 32 channels of 8 layers of LLaMA-7b key and value activation embeddings on WikiText-2. Channel pairs exhibit high levels of linear dependency, shown by high magnitudes of the correlation coefficients.
  • Figure 3: A comparison of 1-bit channel-wise quantization and our proposed Coupled Quantization (using 2 bits per 2 channels as an example). The quantization results on the first two channels of the first-layer key activation embeddings of LLaMA-7b on the WikiText-2 dataset are shown. Channel-wise quantization is ineffective at capturing the original values at low widths, while CQ leverages the dependency between channels to achieve low quantization errors.
  • Figure 4: Perplexity and key/value quantization errors (averaged over all layers) of LLaMA-7b on WikiText-2. Channels coupling and Fisher-guided centroid learning are effective for improving perplexity.
  • Figure 5: Correlation matrix for the first 32 channels of pre-RoPE key activation embeddings of all LLaMA-7b layers on WikiText-2.
  • ...and 3 more figures