Table of Contents
Fetching ...

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

TL;DR

ResQ tackles the challenge of post-training quantization for large language models by enabling aggressive mixed-precision quantization across weights, activations, and KV caches. It uses PCA to identify a low-rank subspace that captures the majority of activation variance and preserves that subspace in high precision ($8$-bit), while quantizing the remaining components to $4$-bit; invariant random rotations are applied within subspaces to suppress outliers. The approach is proven to minimize quantization error and is designed to be accelerator-friendly by fusing projections into weights and using efficient structures (e.g., Hadamard for $\boldsymbol{U}_D$). Empirical results on Llama and Qwen models show ResQ surpasses state-of-the-art PTQ methods in perplexity and accuracy across language modeling, reasoning, and multi-modal tasks, with substantial runtime speedups on GPUs. This work enables practical, low-cost deployment of large LLMs in resource-constrained settings and provides a CUDA-accelerated pipeline for mixed-precision quantization.

Abstract

Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3\times speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.

ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

TL;DR

ResQ tackles the challenge of post-training quantization for large language models by enabling aggressive mixed-precision quantization across weights, activations, and KV caches. It uses PCA to identify a low-rank subspace that captures the majority of activation variance and preserves that subspace in high precision (-bit), while quantizing the remaining components to -bit; invariant random rotations are applied within subspaces to suppress outliers. The approach is proven to minimize quantization error and is designed to be accelerator-friendly by fusing projections into weights and using efficient structures (e.g., Hadamard for ). Empirical results on Llama and Qwen models show ResQ surpasses state-of-the-art PTQ methods in perplexity and accuracy across language modeling, reasoning, and multi-modal tasks, with substantial runtime speedups on GPUs. This work enables practical, low-cost deployment of large LLMs in resource-constrained settings and provides a CUDA-accelerated pipeline for mixed-precision quantization.

Abstract

Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3\times speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.

Paper Structure

This paper contains 23 sections, 3 theorems, 12 equations, 7 figures, 10 tables.

Key Result

Lemma 4.1

By Central Limit Theorem, the distribution after multiplication with random orthogonal matrix is approximately Gaussian tseng2024quipsharp.

Figures (7)

  • Figure 1: (a)-(c) Different approaches to quantization including ResQ. Symbol sizes represent magnitudes of values and colors indicate precisions of quantization (blue: low precision, orange: high precision). (d)-(e) Quantization SNR comparison of ResQ with other baselines.
  • Figure 2: Matrix multiplication with mixed precision operands
  • Figure 3: Activation distribution of the baseline and applying the projection matrices.
  • Figure 4: Model inference with ResQ incorporating the projection matrices. (a) ${\bm{U}}_A$ modifies the inputs across blocks enabling better quantization. (b) ${\bm{U}}_B, {\bm{U}}_C$ enables mixed precision quantization of KV cache. (c) ${\bm{U}}_D$ projects the activations and weights of down_proj layer.
  • Figure 5: Speedup of ResQ and INT4 kernel on single decoder block on NVIDIA RTX 3090 over 16-bit floating point baseline for batch size of 1.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Lemma 4.1
  • Theorem 4.2
  • Lemma 1.1