Low-Rank Correction for Quantized LLMs

Meyer Scetbon; James Hensman

Low-Rank Correction for Quantized LLMs

Meyer Scetbon, James Hensman

TL;DR

This paper tackles post-training quantization of large language models by jointly quantizing weights and activations at $W4A4$ while correcting activation-quantization errors with full-precision low-rank terms. It introduces Low-Rank Correction (LRC), a framework that alternates between updating a quantized weight and learning a low-rank correction, with an initialization and numerical-stability scheme designed for LLMs. The method demonstrates substantial performance gains, achieving over a $50\%$ reduction in the accuracy gap using $10\%$ of the original weight size and completely closing the gap with $30\%$ rank on models like Llama-2/3, Phi-3, and Mixtral, and it remains compatible with QuaRot and GPTQ. This approach enables effective 4-bit quantization of both weights and activations, providing practical memory and compute benefits for deploying large models on constrained hardware. The work highlights activation quantization as a primary source of error under aggressive quantization and offers a scalable, calibration-data-driven route to mitigate it.

Abstract

We consider the problem of model compression for Large Language Models (LLMs) at post-training time, where the task is to compress a well-trained model using only a small set of calibration input data. In this work, we introduce a new low-rank approach to correct for quantization errors of \emph{activations} in LLMs: we propose to add low-rank weight matrices in full precision that act on the \emph{unquantized} activations. We then solve a joint optimization problem over the quantized representation of the weights and additional low-rank weight matrices to quantize both weights and activations. We focus on the case of 4-bit weight-and-activation quantization (W4A4). Using ranks equivalent to 10\% of the original weight matrix size, our approach reduces the accuracy gap with the original model by more than 50\%. Using ranks equivalent to 30\% of the original weight matrix, the accuracy gap is closed completely. We demonstrate our results on four recent LLMs, namely Llama-2, Llama-3, Phi-3 and Mixtral models.

Low-Rank Correction for Quantized LLMs

TL;DR

This paper tackles post-training quantization of large language models by jointly quantizing weights and activations at

while correcting activation-quantization errors with full-precision low-rank terms. It introduces Low-Rank Correction (LRC), a framework that alternates between updating a quantized weight and learning a low-rank correction, with an initialization and numerical-stability scheme designed for LLMs. The method demonstrates substantial performance gains, achieving over a

reduction in the accuracy gap using

of the original weight size and completely closing the gap with

rank on models like Llama-2/3, Phi-3, and Mixtral, and it remains compatible with QuaRot and GPTQ. This approach enables effective 4-bit quantization of both weights and activations, providing practical memory and compute benefits for deploying large models on constrained hardware. The work highlights activation quantization as a primary source of error under aggressive quantization and offers a scalable, calibration-data-driven route to mitigate it.

Low-Rank Correction for Quantized LLMs

TL;DR

Abstract

Low-Rank Correction for Quantized LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)