Table of Contents
Fetching ...

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Yeonsik Park, Hyeonseong Kim, Seungkyu Choi

TL;DR

SERQ is proposed, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.

Abstract

Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

TL;DR

SERQ is proposed, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.

Abstract

Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.
Paper Structure (24 sections, 10 equations, 4 figures, 13 tables)

This paper contains 24 sections, 10 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Computation flow of a linear layer under different matrix decomposition methods. LLM.int8() employs a mixed-precision scheme of INT8 and FP16 by assigning separate computation paths for outliers and non-outliers. L$^2$QER applies SVD-based error reconstruction with a mixed-precision path of INT4 and INT8. In contrast, the proposed SERQ leverages a saliency-guided low-rank matrix and provides a unified computation path with INT4 or MXFP4 precision.
  • Figure 2: (a) Overall SERQ implementation. During calibration, saliency rows are determined via activation scaling, followed by weight row permutation. During inference, error reconstruction is performed through a residual path computed only on the salient components, alongside the main path. (b) Computation flow of a decoder layer. The merged row- and column-wise weight permutation enables offline preprocessing of both current weight rows and subsequent activation channels.
  • Figure 3: GPU performance comparison. We report latency overhead analysis across various matrix sizes (batch size is 1 and token length is 4k). SERQ is particularly effective for larger row-sized matrices. (See Appendix A.6).
  • Figure 4: The trade-off between loss from rank reduction and the coverage of error reconstruction. The figure shows that higher accuracy is achieved by reconstructing errors for salient rows with smaller ranks, rather than covering a larger portion of the weight matrix.