Table of Contents
Fetching ...

PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

TL;DR

PM-KVQ tackles the memory blowup of KV caches in long-CoT LLMs by combining progressive mixed-precision quantization, block-wise memory allocation, and calibration with positional interpolation. The method reduces cumulative quantization error while reflecting long-context distributions without adding overhead, by solving an Integer Programming problem to allocate per-block bit-widths and using an Equivalent Right Shift strategy. It integrates a pipeline that profiles block sensitivity, applies channel-wise reparameterization with positional interpolation, and performs progressive quantization during inference. Across 7B–70B models, PM-KVQ achieves up to 8% improvements on reasoning benchmarks under the same memory budget and demonstrates robust performance even at 2-bit KV Cache, with ablations validating each contribution. The approach offers practical, memory-efficient long-CoT inference and comes with publicly available code.

Abstract

Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.

PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

TL;DR

PM-KVQ tackles the memory blowup of KV caches in long-CoT LLMs by combining progressive mixed-precision quantization, block-wise memory allocation, and calibration with positional interpolation. The method reduces cumulative quantization error while reflecting long-context distributions without adding overhead, by solving an Integer Programming problem to allocate per-block bit-widths and using an Equivalent Right Shift strategy. It integrates a pipeline that profiles block sensitivity, applies channel-wise reparameterization with positional interpolation, and performs progressive quantization during inference. Across 7B–70B models, PM-KVQ achieves up to 8% improvements on reasoning benchmarks under the same memory budget and demonstrates robust performance even at 2-bit KV Cache, with ablations validating each contribution. The approach offers practical, memory-efficient long-CoT inference and comes with publicly available code.

Abstract

Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.

Paper Structure

This paper contains 25 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Method Overview. (a) The Progressive quantization technique: we progressively shrink the bit-width of KV Cache to fully utilize the memory budget. (b) The block-wise memory allocation technique: we allocate a higher bit-width to those transformer blocks with higher sensitivity. (c) Calibration with Positional Interpolation to approximate the distribution of long-context data with short-context data.
  • Figure 2: Different bit-width shrinking strategies when the bit-width is reduced from 4-bit to 2-bit.
  • Figure 3: Sensitivity to quantization of KV Cache in different transformer blocks. Different colors represents different memory budgets.
  • Figure 4: Sensitivity to quantization of KV Cache in different transformer blocks. Different colors represents different memory budgets.