Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Peiyu Liu; Ze-Feng Gao; Wayne Xin Zhao; Yipeng Ma; Tao Wang; Ji-Rong Wen

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen

TL;DR

This work tackles the memory bottleneck of KV cache in large language model inference by introducing DecoQuant, a data-free, low-bit quantization method based on Matrix Product Operator tensor decomposition. By splitting activation matrices into a large central tensor with a narrowed value range and a small auxiliary tensor, DecoQuant enables $B$-bit quantization of the large part while preserving FP16 precision for the small part, dramatically reducing KV cache memory with minimal impact on generation quality. The approach supports multiple quantization settings (WxA16, W16Ax, WxAx) and includes a fused dequantization kernel to boost efficiency, achieving up to ~75% memory reduction and ~1.25x speedup for long sequences. Extensive experiments on LLaMA and OPT models demonstrate competitive performance against baselines like RTN and SmoothQuant, underscoring DecoQuant’s practical impact for data-constrained deployment of LLMs.

Abstract

Key-value~(KV) caching is an important technique to accelerate the inference of large language models~(LLMs), but incurs significant memory overhead. To compress the size of KV cache, existing methods often compromise precision or require extra data for calibration, limiting their practicality in LLM deployment. In this paper, we introduce \textbf{DecoQuant}, a novel data-free low-bit quantization technique based on tensor decomposition methods, to effectively compress KV cache. Our core idea is to adjust the outlier distribution of the original matrix by performing tensor decomposition, so that the quantization difficulties are migrated from the matrix to decomposed local tensors. Specially, we find that outliers mainly concentrate on small local tensors, while large tensors tend to have a narrower value range. Based on this finding, we propose to apply low-bit quantization to the large tensor, while maintaining high-precision representation for the small tensor. Furthermore, we utilize the proposed quantization method to compress the KV cache of LLMs to accelerate the inference and develop an efficient dequantization kernel tailored specifically for DecoQuant. Through extensive experiments, DecoQuant demonstrates remarkable efficiency gains, showcasing up to a $\sim$75\% reduction in memory footprint while maintaining comparable generation quality.

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

TL;DR

-bit quantization of the large part while preserving FP16 precision for the small part, dramatically reducing KV cache memory with minimal impact on generation quality. The approach supports multiple quantization settings (WxA16, W16Ax, WxAx) and includes a fused dequantization kernel to boost efficiency, achieving up to ~75% memory reduction and ~1.25x speedup for long sequences. Extensive experiments on LLaMA and OPT models demonstrate competitive performance against baselines like RTN and SmoothQuant, underscoring DecoQuant’s practical impact for data-constrained deployment of LLMs.

Abstract

75\% reduction in memory footprint while maintaining comparable generation quality.

Paper Structure (19 sections, 4 equations, 6 figures, 5 tables)

This paper contains 19 sections, 4 equations, 6 figures, 5 tables.

Introduction
Preliminary
Methods
DecoQuant: Matrix Quantization based on Decomposition
Efficient Inference based on DecoQuant
Discussion
Experiments
Experimental Setup
Main Results
Detailed Analysis
Analysis of the Efficiency
Memory and Latency.
Related Work
Conclusion
Limitations
...and 4 more sections

Figures (6)

Figure 1: Outlier distributions of local tensors and matrices. "Keys" are extracted from the output features of value projections in the 16th layer of LLaMA-7B. Investigations of other structures can refer to Appendix \ref{['app-outlier']}.
Figure 2: Matrix quantization based on DecoQuant. The alternating black/white and blue/white squares in the figure denote quantized matrices.
Figure 3: Operator fusion for dequantization.
Figure 4: Quantization error analysis about quantization strategy and length of decomposition.
Figure 5: Comparison between MPO with other decomposition methods.
...and 1 more figures

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

TL;DR

Abstract

Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (6)