Table of Contents
Fetching ...

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Fei Li, Song Liu, Weiguo Wu, Shiqiang Nie, Jinyu Wang

TL;DR

KVmix tackles the KV Cache memory bottleneck in LLM inference by introducing a gradient-guided, layer-wise mixed-precision quantization complemented by a dynamic long-context optimization (RPC). The method computes per-layer KV importance from gradients $s_{k_i}$ and $s_{v_i}$, enabling selective high-bit allocation and aggressive low-bit quantization, while keeping recent pivotal KV pairs in full precision to preserve generation quality. It combines asymmetric per-channel Key quantization with per-token Value quantization, uses group-wise packing for 1–4 bit configurations, and implements fused CUDA kernels to minimize overhead. Empirical results across Llama, Mistral, and Falcon models show substantial memory reductions ($\approx$4.9×) and throughput gains (up to $\approx$5.3×) with near-baseline accuracy on long-context tasks, and competitive accuracy on GSM8K and Wikitext-2 benchmarks, suggesting strong practical impact for resource-constrained deployments.

Abstract

The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

TL;DR

KVmix tackles the KV Cache memory bottleneck in LLM inference by introducing a gradient-guided, layer-wise mixed-precision quantization complemented by a dynamic long-context optimization (RPC). The method computes per-layer KV importance from gradients and , enabling selective high-bit allocation and aggressive low-bit quantization, while keeping recent pivotal KV pairs in full precision to preserve generation quality. It combines asymmetric per-channel Key quantization with per-token Value quantization, uses group-wise packing for 1–4 bit configurations, and implements fused CUDA kernels to minimize overhead. Empirical results across Llama, Mistral, and Falcon models show substantial memory reductions (4.9×) and throughput gains (up to 5.3×) with near-baseline accuracy on long-context tasks, and competitive accuracy on GSM8K and Wikitext-2 benchmarks, suggesting strong practical impact for resource-constrained deployments.

Abstract

The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.

Paper Structure

This paper contains 29 sections, 8 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Accuracy of the Llama 2-7B model touvron2023llama on the GSM8K cobbe2021training and TruthfulQA lin-etal-2022-truthfulqa datasets using lm_eval eval-harness (FP16 represents no quantization; 0-3 indicates 2-bit quantization applied individually to the Key or Value of layers 0 through 3, respectively. And so on).
  • Figure 2: Projection matrix weights of K and V across different layers for the Llama 2-7B model. "Norm" represents the L2 norm of weight matrix for each layer, while "Range" indicates the range of values within each layer's weight matrix.
  • Figure 3: The overview of KVmix profiler.
  • Figure 4: Dynamic adjustment of quantized KV Cache based on RPC during prefill and decoding phases.
  • Figure 5: Performance variation of Llama 2-7B with different quantization configurations ("10%" indicates the top 10% important layers are quantized to 4 and 3 bits, and the remaining layers are quantized to 2 bits. The dataset is GSM8K).
  • ...and 8 more figures