Table of Contents
Fetching ...

Low-Rank Correction for Quantized LLMs

Meyer Scetbon, James Hensman

TL;DR

This paper tackles post-training quantization of large language models by jointly quantizing weights and activations at $W4A4$ while correcting activation-quantization errors with full-precision low-rank terms. It introduces Low-Rank Correction (LRC), a framework that alternates between updating a quantized weight and learning a low-rank correction, with an initialization and numerical-stability scheme designed for LLMs. The method demonstrates substantial performance gains, achieving over a $50\%$ reduction in the accuracy gap using $10\%$ of the original weight size and completely closing the gap with $30\%$ rank on models like Llama-2/3, Phi-3, and Mixtral, and it remains compatible with QuaRot and GPTQ. This approach enables effective 4-bit quantization of both weights and activations, providing practical memory and compute benefits for deploying large models on constrained hardware. The work highlights activation quantization as a primary source of error under aggressive quantization and offers a scalable, calibration-data-driven route to mitigate it.

Abstract

We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy gap is closed completely. We demonstrate our results on four recent LLMs, namely Llama-2, Llama-3, Phi-3 and Mixtral models.

Low-Rank Correction for Quantized LLMs

TL;DR

This paper tackles post-training quantization of large language models by jointly quantizing weights and activations at while correcting activation-quantization errors with full-precision low-rank terms. It introduces Low-Rank Correction (LRC), a framework that alternates between updating a quantized weight and learning a low-rank correction, with an initialization and numerical-stability scheme designed for LLMs. The method demonstrates substantial performance gains, achieving over a reduction in the accuracy gap using of the original weight size and completely closing the gap with rank on models like Llama-2/3, Phi-3, and Mixtral, and it remains compatible with QuaRot and GPTQ. This approach enables effective 4-bit quantization of both weights and activations, providing practical memory and compute benefits for deploying large models on constrained hardware. The work highlights activation quantization as a primary source of error under aggressive quantization and offers a scalable, calibration-data-driven route to mitigate it.

Abstract

We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy gap is closed completely. We demonstrate our results on four recent LLMs, namely Llama-2, Llama-3, Phi-3 and Mixtral models.

Paper Structure

This paper contains 44 sections, 3 theorems, 35 equations, 6 figures, 10 tables, 5 algorithms.

Key Result

Proposition 3.1

Let us denote $\bm{Y}:=Q_a(\bm{X})\in\mathcal{C}(a)\cap\mathbb{R}^{d^{\text{in}}\times n}$, and assume $\bm{Y}$ is full rank. Then, by denoting $\widetilde{\bm{W}}^{(t)}:=(\bm{W} - \bm{U}^{(t)}(\bm{V}^{(t)})^\top)\bm{X}\bm{Y}^\top (\bm{Y}\bm{Y}^\top)^{-1}$, we have that the optimization problem defi

Figures (6)

  • Figure 1: Computational scheme of our method, where both weights and activations are quantized, and a low-rank matrix in full precision is added and operates on the unquantized activations.
  • Figure 3: We show the effect of applying LRC with two quantization schemes, namely GPTQ and RTN, on the performances of Phi-3 on lm-eval tasks at W4A4.
  • Figure 4: We show the effect of the rank, chosen as a percentage of the original weight matrices, on the performances of $\text{Llama-3 (8B)}$ for lm-eval tasks when quantized at W4A4. We also show the effect of groupsizing activations. As baselines (dashed lines), we plot the performances of QuaRot with and without groupsizing, as well as the performance of the original model.
  • Figure : Phi-3
  • Figure : Phi-3
  • ...and 1 more figures

Theorems & Definitions (6)

  • Proposition 3.1
  • Remark 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Remark 3.5
  • Remark B.1