Table of Contents
Fetching ...

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, Chun Yuan

TL;DR

IntactKV addresses quantization-induced degradation in large language models by preserving the KV cache of pivot tokens that drive attention at the start of inputs. The method generates a lossless KV prefix from the full-precision model and optionally calibrates it as trainable parameters, enabling compatibility with weight-only, KV-cache, and activation quantization without adding inference cost. Theoretical analysis shows that preserving pivot-token KV caches tightens the quantization error bound, and empirical results across LLaMA, LLaMA-2, and Vicuna backbones demonstrate consistent improvements and new state-of-the-art performance on generation, MMLU, commonsense QA, and MT-Bench tasks. This approach offers a lightweight, plug-in enhancement for quantized LLMs with practical benefits for deployment efficiency and accuracy, including potential calibration of the KV prefix to further close the gap to full-precision models.

Abstract

Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.

IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact

TL;DR

IntactKV addresses quantization-induced degradation in large language models by preserving the KV cache of pivot tokens that drive attention at the start of inputs. The method generates a lossless KV prefix from the full-precision model and optionally calibrates it as trainable parameters, enabling compatibility with weight-only, KV-cache, and activation quantization without adding inference cost. Theoretical analysis shows that preserving pivot-token KV caches tightens the quantization error bound, and empirical results across LLaMA, LLaMA-2, and Vicuna backbones demonstrate consistent improvements and new state-of-the-art performance on generation, MMLU, commonsense QA, and MT-Bench tasks. This approach offers a lightweight, plug-in enhancement for quantized LLMs with practical benefits for deployment efficiency and accuracy, including potential calibration of the KV prefix to further close the gap to full-precision models.

Abstract

Large language models (LLMs) excel in natural language processing but demand intensive computation. To mitigate this, various quantization methods have been explored, yet they compromise LLM performance. This paper unveils a previously overlooked type of outliers in LLMs. Such outliers are found to allocate most of the attention scores on initial tokens of input, termed as pivot tokens, which are crucial to the performance of quantized LLMs. Given that, we propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model. The approach is simple and easy to combine with existing quantization solutions with no extra inference overhead. Besides, IntactKV can be calibrated as additional LLM parameters to boost the quantized LLMs further with minimal training costs. Mathematical analysis also proves that IntactKV effectively reduces the upper bound of quantization error. Empirical results show that IntactKV brings consistent improvement over various quantization methods across different LLMs and downstream tasks, leading to the new state-of-the-art for LLM quantization. The codes are available at https://github.com/ruikangliu/IntactKV.
Paper Structure (52 sections, 1 theorem, 9 equations, 21 figures, 17 tables)

This paper contains 52 sections, 1 theorem, 9 equations, 21 figures, 17 tables.

Key Result

Theorem 1

Given the query vector ${\bm q}\in\mathbb{R}^d$ and the change of KV caches $\Delta {\bm K} , \Delta{\bm V} \in \mathbb{R}^{n\times d}$, the change of the attention head $\Delta{\bm h}$ is bounded by where $C_1 = \frac{n^{3/2} }{ \sqrt{d} } C_3 {\Vert {\bm q} \Vert}_2, C_2 = C_1 {\Vert {\bm V} \Vert}_2$ and $C_3={\Vert {\bm W}^O \Vert}_2$.

Figures (21)

  • Figure 1: Visualizations of Transformer output and attention scores of LLaMA-30B and LLaMA-2-7B. Observations: (1) There are token-specific outliers that can be orders of magnitudes larger than the rest of the tokens (enlarged in the box). Such tokens occur at the [BOS] token, the 28th token "'" in LLaMA-30B and 13th token "." in LLaMA-2-7B, which are referred to as pivot tokens; (2) These outliers over pivot tokens make the attention scores concentrated on themselves, which are likely to be affected by quantization. More details can be found in Appendix \ref{['apdx-subsec:vis_details']}.
  • Figure 2: The mean squared error (MSE) of the last Transformer layer and attention layers w.r.t. the varying sizes of IntactKV. Observations: (1) The MSE continues to drop as the size of IntactKV increases. (2) Including the pivot tokens' KV cache in IntactKV leads to the most significant decrease in the quantization loss, demonstrating the importance of the pivot tokens' KV cache. More experiment details can be found in Appendix \ref{['apdx-sec:kv-size-details']}.
  • Figure 3: The overview of the proposed IntactKV applied for the supervised fine-tuned LLM. The full-precision model takes the system prompt as input and generates the IntactKV losslessly as the prefix concatenated with the rest of the KV cache of quantized LLMs. IntactKV can be further calibrated by minimizing the mean squared error $\mathcal{L}$ between the full-precision and quantized LLMs.
  • Figure 4: Results of weight and KV cache quantization with different bit-widths on 5-shot MMLU benchmark. Note that this is additional to INT3/4 weight-only quantization. Blue and red lines indicate quantizing model weights to INT3 and INT4, respectively. We apply asymmetric per-head dynamic quantization to the KV cache.
  • Figure 5: System Prompt of Vicuna Models.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof