Table of Contents
Fetching ...

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao

TL;DR

XQuant tackles the memory burden of KV caches in large language models by introducing a training-free, plug-and-play approach that achieves ultra-low-bit quantization. It combines data-free calibration, which adaptively tunes quantization endpoints without data, with cross-layer KV-cache compression that exploits inter-layer similarity to share quantized caches across adjacent layers. Empirical results on TruthfulQA and LongBench show XQuant achieving equivalent bit-widths below 1.4 and competitive to full-precision baselines, outperforming prior methods like KIVI-2bit and AsymKV in many settings. The framework delivers a practical, efficient solution for deploying LLMs in resource-constrained environments, with robust performance across multiple models and tasks and clear insights into hyperparameter stability and speedups.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

TL;DR

XQuant tackles the memory burden of KV caches in large language models by introducing a training-free, plug-and-play approach that achieves ultra-low-bit quantization. It combines data-free calibration, which adaptively tunes quantization endpoints without data, with cross-layer KV-cache compression that exploits inter-layer similarity to share quantized caches across adjacent layers. Empirical results on TruthfulQA and LongBench show XQuant achieving equivalent bit-widths below 1.4 and competitive to full-precision baselines, outperforming prior methods like KIVI-2bit and AsymKV in many settings. The framework delivers a practical, efficient solution for deploying LLMs in resource-constrained environments, with robust performance across multiple models and tasks and clear insights into hyperparameter stability and speedups.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

Paper Structure

This paper contains 34 sections, 26 equations, 4 figures, 11 tables, 2 algorithms.

Figures (4)

  • Figure 1: The illustration of XQuant workflow. XQuant partitions the KV cache into layer-wise pairs. For every higher layer in a pair, XQuant only computes and stores the scaling factors and zero-points during quantization phase, and then fetches the quantized cache from the lower layer during dequantization phase.
  • Figure 2: The illustration of the proposed data-free calibration method.
  • Figure 3: Layer-wise analysis of absolute differences between adjacent layers in quantized KV Cache matrices. Here, delta represents the absolute difference of quantized values between consecutive layers.
  • Figure 4: Comparison of Execution Time.