Table of Contents
Fetching ...

LQER: Low-Rank Quantization Error Reconstruction for LLMs

Cheng Zhang, Jianyi Cheng, George A. Constantinides, Yiren Zhao

TL;DR

LQER introduces a post-training quantization framework that reconstructs quantization error with a high-precision, low-rank correction term, guided by an activation-derived scale to shape the error's singular-value spectrum. This enables nearly lossless W4A8 quantization on diverse LLMs without distillation or iterative optimization, while maintaining a regular computation pattern favorable for hardware. The key contributions are the SVD-based error reconstruction strategy and the activation-aware scaling S, which together reduce the required correction rank and preserve model capability across benchmarks and model families. Empirically, LQER achieves competitive perplexity and downstream task accuracy with significantly lower hardware cost, and it scales to large models with efficient calibration and quantization workflows and open-source availability.

Abstract

Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36$\times$ fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer

LQER: Low-Rank Quantization Error Reconstruction for LLMs

TL;DR

LQER introduces a post-training quantization framework that reconstructs quantization error with a high-precision, low-rank correction term, guided by an activation-derived scale to shape the error's singular-value spectrum. This enables nearly lossless W4A8 quantization on diverse LLMs without distillation or iterative optimization, while maintaining a regular computation pattern favorable for hardware. The key contributions are the SVD-based error reconstruction strategy and the activation-aware scaling S, which together reduce the required correction rank and preserve model capability across benchmarks and model families. Empirically, LQER achieves competitive perplexity and downstream task accuracy with significantly lower hardware cost, and it scales to large models with efficient calibration and quantization workflows and open-source availability.

Abstract

Post-training quantization of Large Language Models (LLMs) is challenging. In this work, we introduce Low-rank Quantization Error Reduction (LQER), which combines quantization and low-rank approximation to recover the model capability. LQER leverages an activation-induced scale matrix to drive the singular value distribution of quantization error towards a desirable distribution, which enables nearly-lossless W4A8 quantization on various LLMs and downstream tasks without the need for knowledge distillation, grid search, or gradient-base iterative optimization. Unlike existing methods, the computation pattern of LQER eliminates the need for specialized Scatter and Gather processes to collect high-precision weights from irregular memory locations. Our W4A8 LLMs achieve near-lossless performance on six popular downstream tasks, while using 1.36 fewer hardware resources than the leading state-of-the-art method. We open-source our framework at https://github.com/ChengZhang-98/lqer
Paper Structure (31 sections, 15 equations, 4 figures, 21 tables)

This paper contains 31 sections, 15 equations, 4 figures, 21 tables.

Figures (4)

  • Figure 1: Motivation and computation pattern of . (a) We apply SVD to the quantization error $E_q=W-W_q$ for a 3-bit fixed-point quantized weight in OPT-1.3B, and plot their singular values distributions. Distributions are normalized to have the same Frobenius norm for a fair comparison\ref{['footnote:normalization-in-fig1']}. Curves with a more asymptotic trend suggest better suitability for low-rank approximation. displays a much steeper distribution with a smaller number of dominating singular values. (b) approximates a trained weight $W$ with two high-precision yet low-rank matrics $A_k$ and $B_k$, and a low-precision yet high-rank matrix $W_q$. Both components are inexpensive to compute. This estbalishes a regular computation pattern that eliminates the need for irregular memory access like the Scatter and Gather operations in LLM.int8().
  • Figure 2: number format rouhani2023microscaling. places a shared exponent across a group of fixed-point numbers. is more hardware efficient than floating point for its simplified vector inner product, and provides a large dynamic range compared to fixed-point numbers. has been standardized recently for next generation AI hardware systems mxspecs2023.
  • Figure 3: Perplexity ($\downarrow$) vs rank. We apply W3A8 and to OPT-1.3B and plot the resultant perplexity. Considering the embedding dimension is 2048, requires a fairly large $k\approx 600$ to reach a perplexity close to FP16 . In comparison, a small $k\approx 64$ is enough for Comparison of perplexity ($\downarrow$) and quantization error reconstruction between and .
  • Figure 4: Approximation error of and across decoder layers in LLaMA-7B. produces smaller approximation errors on most of the linear layers in transformer-based LLMs. However, there are a few layers better reconstructed by , such as the key, value, output project layers in 1st, 3rd, and 4th decoder layer. The derivation of $S$ worths further exploration.