Table of Contents
Fetching ...

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu

TL;DR

AnTKV tackles the memory bottleneck of KV caches in large language models by introducing anchor token-aware vector quantization. It combines offline token-aware centroid learning with online Anchor Score-based token selection to preserve a small subset of high-impact tokens in full precision, enabling sub-bit KV quantization with minimal accuracy loss. Empirical results across LLaMA-2/3 and Mistral show strong perplexity and reasoning performance gains, substantial long-context scalability, and up to 3.5x decoding throughput improvements against FP16 baselines. This work demonstrates that selectively preserving anchor tokens can unlock efficient ultra-low-bit KV cache quantization for practical long-context inference.

Abstract

Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models. Nevertheless, minimizing the accuracy degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. While scalar quantization is constrained by 1-bit bound, vector quantization exploits intra-vector correlations and enables sub-bit regimes, making it more suitable for ultra-low-bit quantization. To further mitigate quantization-induced degradation, we reveal that the degradation is highly uneven across tokens in attention quality. To investigate this unevenness, we introduce anchor score to measure each token's sensitivity to quantization. Our analysis and experiments show that preserving a small subset (1\%) of tokens with the highest Anchor Score significantly mitigates accuracy loss under aggressive quantization. We propose AnTKV, a dual-stage framework that leverages anchor token-aware vector quantization to compress the KV cache. It combines offline token-aware centroids learning and online anchor token selection to balance compression and accuracy. To enable efficient deployment, we design an online anchor token selection kernel compatible with FlashAttention. It allows LLaMA3-8B to scale to 840K tokens on a single 80GB A100, while delivering up to $3.5\times$ higher decoding throughput over the FP16 baseline. Experiments demonstrate that AnTKV matches or surpasses prior methods at 4-bit, and significantly reduce perplexity under ultra-low-bit quantization, achieving 6.32 at 1-bit on Mistral-7B, compared to 7.25 for CQ and 15.36 for KVQuant.

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

TL;DR

AnTKV tackles the memory bottleneck of KV caches in large language models by introducing anchor token-aware vector quantization. It combines offline token-aware centroid learning with online Anchor Score-based token selection to preserve a small subset of high-impact tokens in full precision, enabling sub-bit KV quantization with minimal accuracy loss. Empirical results across LLaMA-2/3 and Mistral show strong perplexity and reasoning performance gains, substantial long-context scalability, and up to 3.5x decoding throughput improvements against FP16 baselines. This work demonstrates that selectively preserving anchor tokens can unlock efficient ultra-low-bit KV cache quantization for practical long-context inference.

Abstract

Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models. Nevertheless, minimizing the accuracy degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. While scalar quantization is constrained by 1-bit bound, vector quantization exploits intra-vector correlations and enables sub-bit regimes, making it more suitable for ultra-low-bit quantization. To further mitigate quantization-induced degradation, we reveal that the degradation is highly uneven across tokens in attention quality. To investigate this unevenness, we introduce anchor score to measure each token's sensitivity to quantization. Our analysis and experiments show that preserving a small subset (1\%) of tokens with the highest Anchor Score significantly mitigates accuracy loss under aggressive quantization. We propose AnTKV, a dual-stage framework that leverages anchor token-aware vector quantization to compress the KV cache. It combines offline token-aware centroids learning and online anchor token selection to balance compression and accuracy. To enable efficient deployment, we design an online anchor token selection kernel compatible with FlashAttention. It allows LLaMA3-8B to scale to 840K tokens on a single 80GB A100, while delivering up to higher decoding throughput over the FP16 baseline. Experiments demonstrate that AnTKV matches or surpasses prior methods at 4-bit, and significantly reduce perplexity under ultra-low-bit quantization, achieving 6.32 at 1-bit on Mistral-7B, compared to 7.25 for CQ and 15.36 for KVQuant.

Paper Structure

This paper contains 28 sections, 1 theorem, 14 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\delta\bm{ K}$ and $\delta\bm{ V}$ be the error perturbation terms corresponding to $\bm{K}$ and $\bm{V}$ respectively, and satisfy Then we have and where $\bm{e}\in\mathbb{R}^{n}$ is a vector whose entries are all $1$.

Figures (13)

  • Figure 1: The $L_1$ norm error of attention output when quantizing the $i$th token’s KV cache in Mistral-7B to 1-bit.
  • Figure 2: The Perplexity of Mistral-7B on the WikiText-2 across different quantization bit-widths.
  • Figure 3: Overview of AnTKV. In the stage (a), token-aware centroids are learned from calibration data through weighted clustering, where the weights are error-propagation factors obtained by forward error analysis. In the stage (b), the KV cache is quantized with centroids, and AnS is computed to identify anchor tokens, which are preserved in full precision to mitigate accuracy loss.
  • Figure 4: Evaluation of understanding and reasoning accuracy on MMLU, ARC-C, PIQA, and MathQA under different quantization bit-widths.
  • Figure 5: The evaluation accuracy results on LongBench under different KV cache quantization bit-widths.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Theorem 3.1