Table of Contents
Fetching ...

LCQ: Low-Rank Codebook based Quantization for Large Language Models

Wen-Pu Cai, Ming-Yang Li, Wu-Jun Li

TL;DR

This work tackles the challenge of deploying large language models under storage and compute constraints by enhancing weight quantization. It introduces LCQ, a low-rank codebook quantization method where the codebook is $\mathbf{C} = \mathbf{S}^T \mathbf{V} - \mathbf{B}$, allowing ranks greater than one for richer representation, and optimizes $\mathbf{S}, \mathbf{V}, \mathbf{B}$ via gradient-based learning. The framework uses a Transformer-wide output-reconstruction objective, gradient approximations for quantization, and reparameterization to stabilize training, complemented by a double-quantization strategy to reduce storage. Empirical results across OPT, LLaMA, and LLaVA show LCQ outperforms rank-one baselines (AWQ, OmniQuant), especially at 2-bit quantization, with negligible additional storage, supporting practical deployment on resource-constrained devices. The approach advances LLM quantization by balancing accuracy, storage, and compatibility with existing PTQ workflows, enabling more efficient and accessible large-scale models.

Abstract

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.

LCQ: Low-Rank Codebook based Quantization for Large Language Models

TL;DR

This work tackles the challenge of deploying large language models under storage and compute constraints by enhancing weight quantization. It introduces LCQ, a low-rank codebook quantization method where the codebook is , allowing ranks greater than one for richer representation, and optimizes via gradient-based learning. The framework uses a Transformer-wide output-reconstruction objective, gradient approximations for quantization, and reparameterization to stabilize training, complemented by a double-quantization strategy to reduce storage. Empirical results across OPT, LLaMA, and LLaVA show LCQ outperforms rank-one baselines (AWQ, OmniQuant), especially at 2-bit quantization, with negligible additional storage, supporting practical deployment on resource-constrained devices. The approach advances LLM quantization by balancing accuracy, storage, and compatibility with existing PTQ workflows, enabling more efficient and accessible large-scale models.

Abstract

Large language models~(LLMs) have recently demonstrated promising performance in many tasks. However, the high storage and computational cost of LLMs has become a challenge for deploying LLMs. Weight quantization has been widely used for model compression, which can reduce both storage and computational cost. Most existing weight quantization methods for LLMs use a rank-one codebook for quantization, which results in substantial accuracy loss when the compression ratio is high. In this paper, we propose a novel weight quantization method, called low-rank codebook based quantization~(LCQ), for LLMs. LCQ adopts a low-rank codebook, the rank of which can be larger than one, for quantization. Experiments show that LCQ can achieve better accuracy than existing methods with a negligibly extra storage cost.
Paper Structure (15 sections, 11 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of low-rank codebook based quantization.
  • Figure 2: Visualization of $\mathcal{S}$ with AWQ, shown in the log scale. 3-bit quantization of a Transformer block in OPT-1.3B model. Left image: block 1; middle image: block 10; right image: block 20.
  • Figure 3: Training and inference process of LCQ with reparameterization.
  • Figure 4: Hyperparameter sensitivity with "W2 G128" for OPT-1.3B.