Table of Contents
Fetching ...

PatternKV: Flattening KV Representation Expands Quantization Headroom

Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

TL;DR

PatternKV addresses the KV cache bottleneck in autoregressive LLMs by shifting from outlier-focused quantization to distribution flattening through online pattern mining and residual quantization. By aligning each KV vector to a nearest pattern and quantizing only the residual, PatternKV flattens the quantization target, reducing the required bit-width while maintaining fidelity. The approach is grounded in a variance-decomposition view and demonstrates stable K-cache structure with latent V-cache patterns, enabling robust performance across long-context and test-time scaling settings, including a 1.4× throughput gain and 1.25× larger batch support, with only a 0.08% average 4-bit drop on FP16. These contributions offer practical, scalable improvements for deploying high-context LLMs with low-bit KV representations.

Abstract

KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.

PatternKV: Flattening KV Representation Expands Quantization Headroom

TL;DR

PatternKV addresses the KV cache bottleneck in autoregressive LLMs by shifting from outlier-focused quantization to distribution flattening through online pattern mining and residual quantization. By aligning each KV vector to a nearest pattern and quantizing only the residual, PatternKV flattens the quantization target, reducing the required bit-width while maintaining fidelity. The approach is grounded in a variance-decomposition view and demonstrates stable K-cache structure with latent V-cache patterns, enabling robust performance across long-context and test-time scaling settings, including a 1.4× throughput gain and 1.25× larger batch support, with only a 0.08% average 4-bit drop on FP16. These contributions offer practical, scalable improvements for deploying high-context LLMs with low-bit KV representations.

Abstract

KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.

Paper Structure

This paper contains 39 sections, 25 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The left figure illustrates the original distribution of the KV vectors, while the right figure depicts the distribution of the residuals obtained after aligning the original vectors with the corresponding pattern vectors. Each pattern vector is the centroid of its cluster.
  • Figure 2: Channel-wise mean absolute value distributions. Left: embedding-only injection; Right: full-input injection. Outlier channels are already evident under embedding-only input, and the full input further enlarges the range and extremes. Additional figures are provided in Appendix \ref{['app:more_figs']}.
  • Figure 3: (a) t-SNE visualization of the of K-cache distributions across attention heads along a single inference trajectory. (b) Illustration of the degree of alignment between V cache clusters and semantic categories. Additional figures are provided in Appendix \ref{['app:more_figs']}.
  • Figure 4: Overview of the PatternKV pipeline: pattern vectors are mined online, KV vectors are aligned to their nearest pattern, and only residuals are quantized.
  • Figure 5: GSM8K accuracy under zero-shot CoT on Llama-3.1-8B-Instruct.
  • ...and 4 more figures