Table of Contents
Fetching ...

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

TL;DR

GEAR tackles the KV cache memory bottleneck in generative LLM inference by fusing a quantized backbone with a low-rank residual and a sparse outlier correction, complemented by a streaming buffer and GPU-optimized kernel. This three-way decomposition reduces approximation error at high compression (2-bit) while preserving near-lossless accuracy across challenging CoT tasks and long-context scenarios. Empirically, GEAR delivers up to ~2.4x peak memory reduction and up to ~5x throughput gains versus FP16 KV caches, outperforming state-of-the-art baselines by substantial margins on diverse models and datasets. The approach is plug-and-play with existing quantization schemes and offers a scalable route to memory-efficient, high-throughput generative inference for large language models.

Abstract

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

TL;DR

GEAR tackles the KV cache memory bottleneck in generative LLM inference by fusing a quantized backbone with a low-rank residual and a sparse outlier correction, complemented by a streaming buffer and GPU-optimized kernel. This three-way decomposition reduces approximation error at high compression (2-bit) while preserving near-lossless accuracy across challenging CoT tasks and long-context scenarios. Empirically, GEAR delivers up to ~2.4x peak memory reduction and up to ~5x throughput gains versus FP16 KV caches, outperforming state-of-the-art baselines by substantial margins on diverse models and datasets. The approach is plug-and-play with existing quantization schemes and offers a scalable route to memory-efficient, high-throughput generative inference for large language models.

Abstract

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
Paper Structure (23 sections, 6 equations, 7 figures, 10 tables, 2 algorithms)

This paper contains 23 sections, 6 equations, 7 figures, 10 tables, 2 algorithms.

Figures (7)

  • Figure 1: (\ref{['fig:approximation_error_gsm8k']}) compares the approximation error when compressing KV caches to 2-bit for LLaMA3-8B on GSM8k (w. CoT). (\ref{['fig:generation_deviation']}) presents difference in prediction logits from FP16 baseline after compressing KV caches of an GSM8k (w. CoT) example, indicating the approximation error can be severely compounded along steps and critically divert model generations. (\ref{['fig:acc_gsm8k_cot']}) shows reducing the error can significantly improve the performance.
  • Figure 2: (\ref{['fig:individual_error']}, \ref{['fig:singular_value']}) We randomly sample a GSM8k example and analyze its KV caches by LLaMA2-7B. (\ref{['fig:individual_error']}): the minimal approximation error of each individual technique when approximating the Value cache of the first layer; (\ref{['fig:singular_value']}): spectrum of the residual $\bm{R}_{h}$ decays rapidly. (\ref{['fig:gear_augment_quant']}): As an efficient error-reduction framework, GEAR is orthogonal to any off-the-shelf quantization and can augment them to achieve near-lossless accuracy.
  • Figure 3: (\ref{['fig:time_breakdown']}) wall-clock time percentage of each component in GEAR: sparse and low-rank components induce negligible overheads. (\ref{['fig:peak_memory']}): GEAR significantly reduces the peak memory, enabling much larger batch size than FP16. (\ref{['fig:throughput']}): GEAR improve throughput significantly over FP16 due to our introduced techniques.
  • Figure 4: Analysis and ablation study with LLaMA3-8B on GSM8k-CoT under 2-bit compression.
  • Figure 5: Peak memory and throughput comparison with LLaMA2-7b on an RTX Titan 24GB GPU.
  • ...and 2 more figures