Quantize What Counts: More for Keys, Less for Values
Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
TL;DR
This work addresses inference-time memory bottlenecks from KV-cache in large language models by grounding bit allocation in Transformer geometry. It proves two theorems—Key-Value Norm Disparity and Key-Prioritized Quantization—that show key projections carry higher information density and that allocating more bits to keys strictly reduces quantization error. Extensive experiments across diverse models, benchmarks, and backends demonstrate that a key-first configuration such as $K_4V_2$ can retain up to 98.3% of $K_4V_4$ accuracy while achieving roughly 25% KV-cache memory savings, and that these gains complement rotation-based outlier redistribution. The results establish a geometry-driven design principle for KV quantization that enhances practical efficiency and can be combined with existing KV methods to enable scalable, resource-conscious LLM deployment.
Abstract
Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.
