Table of Contents
Fetching ...

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

TL;DR

This work addresses inference-time memory bottlenecks from KV-cache in large language models by grounding bit allocation in Transformer geometry. It proves two theorems—Key-Value Norm Disparity and Key-Prioritized Quantization—that show key projections carry higher information density and that allocating more bits to keys strictly reduces quantization error. Extensive experiments across diverse models, benchmarks, and backends demonstrate that a key-first configuration such as $K_4V_2$ can retain up to 98.3% of $K_4V_4$ accuracy while achieving roughly 25% KV-cache memory savings, and that these gains complement rotation-based outlier redistribution. The results establish a geometry-driven design principle for KV quantization that enhances practical efficiency and can be combined with existing KV methods to enable scalable, resource-conscious LLM deployment.

Abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Quantize What Counts: More for Keys, Less for Values

TL;DR

This work addresses inference-time memory bottlenecks from KV-cache in large language models by grounding bit allocation in Transformer geometry. It proves two theorems—Key-Value Norm Disparity and Key-Prioritized Quantization—that show key projections carry higher information density and that allocating more bits to keys strictly reduces quantization error. Extensive experiments across diverse models, benchmarks, and backends demonstrate that a key-first configuration such as can retain up to 98.3% of accuracy while achieving roughly 25% KV-cache memory savings, and that these gains complement rotation-based outlier redistribution. The results establish a geometry-driven design principle for KV quantization that enhances practical efficiency and can be combined with existing KV methods to enable scalable, resource-conscious LLM deployment.

Abstract

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Paper Structure

This paper contains 32 sections, 60 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Key cache needs more bits. (Left): Spectral norms of the key cache (blue) and value cache (orange) across layers in Llama3.3-70B show that key caches consistently exhibit higher norms. (Right): GSM8k accuracy for two schemes: $\text{K}_{2}\text{V}_{4}$, representing 2-bit allocation for the K cache and 4-bit allocation for the V cache and $\text{K}_{4}\text{V}_{2}$, representing 4-bit allocation for the K cache and 2-bit allocation for the V cache, demonstrates that allocating more bits to the key cache maintains strong performance, confirming the efficacy of norm-aware, mixed-precision quantization.
  • Figure 2: Frobenius norms of key and value weight matrices across the Llama 3 family.$\|W^K\|_F$ consistently exceeds $\|W^V\|_F$ across nearly all layers, with the exception occurring in early layers of the 70B variant.
  • Figure 3: Singular value spectra of key and value activations in Llama 3.3-70B on C4 benchmark dataset. The x-axis shows singular value indices, ordered from the 5th largest onward for cleaner illustration, and the y-axis shows their magnitudes. Shaded regions mark the minimum-maximum range across attention heads within each layer, while dashed lines indicate the mean at each index. Beyond the top singular value (i.e., the spectral norm), key activations consistently exhibit larger singular values than value activations across the spectrum, highlighting their greater representational capacity. Full spectra are provided in Figure \ref{['fig:appendix_singular_value_distribution']} of Appendix \ref{['sec:appendix-sing-all']}.
  • Figure 4: Integration of rotation and mixed-precision quantization. Downstream accuracy is shown for four quantization configurations ($\mathrm{K}_2\mathrm{V}_2$, $\mathrm{K}_2\mathrm{V}_4$, $\mathrm{K}_4\mathrm{V}_2$, $\mathrm{K}_4\mathrm{V}_4$) combined with four rotation strategies (none, key-only, value-only, both), using a fixed group size of 64 for both keys and values. Results are reported on CoQA, GSM8K, EQ-Bench, and LongBench, enabling a controlled comparison of precision-rotation interactions.
  • Figure 5: Frobenius norm plot for Mistral Family. The x-axis represents the layer index in the model, while the y-axis represents the Frobenius norm magnitude. The spectral norms are higher for the key weights than for the value weights across layers.
  • ...and 6 more figures

Theorems & Definitions (2)

  • proof
  • proof