LQER: Low-Rank Quantization Error Reconstruction for LLMs
Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao
TL;DR
LQER introduces a post-training quantization framework that reconstructs quantization error with a high-precision, low-rank correction term, guided by an activation-derived scale to shape the error's singular-value spectrum. This enables nearly lossless W4A8 quantization on diverse LLMs without distillation or iterative optimization, while maintaining a regular computation pattern favorable for hardware. The key contributions are the SVD-based error reconstruction strategy and the activation-aware scaling S, which together reduce the required correction rank and preserve model capability across benchmarks and model families. Empirically, LQER achieves competitive perplexity and downstream task accuracy with significantly lower hardware cost, and it scales to large models with efficient calibration and quantization workflows and open-source availability.
Abstract
Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer
