ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
TL;DR
ResQ tackles the challenge of post-training quantization for large language models by enabling aggressive mixed-precision quantization across weights, activations, and KV caches. It uses PCA to identify a low-rank subspace that captures the majority of activation variance and preserves that subspace in high precision ($8$-bit), while quantizing the remaining components to $4$-bit; invariant random rotations are applied within subspaces to suppress outliers. The approach is proven to minimize quantization error and is designed to be accelerator-friendly by fusing projections into weights and using efficient structures (e.g., Hadamard for $\boldsymbol{U}_D$). Empirical results on Llama and Qwen models show ResQ surpasses state-of-the-art PTQ methods in perplexity and accuracy across language modeling, reasoning, and multi-modal tasks, with substantial runtime speedups on GPUs. This work enables practical, low-cost deployment of large LLMs in resource-constrained settings and provides a CUDA-accelerated pipeline for mixed-precision quantization.
Abstract
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3\times speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.
