Table of Contents
Fetching ...

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee

TL;DR

The paper tackles the accuracy degradation of post-training quantization (PTQ) for large language models under weight-activation quantization. It introduces Low-Rank Quantization (LRQ), which reconstructs Transformer block outputs using low-rank weight-scaling matrices to replace full per-weight scales, reducing learnable parameters while preserving per-weight flexibility. LRQ is analyzed via block-wise reconstruction, contrasts with FlexRound, and is validated across multiple quantization schemes (8-bit/8-bit, 4-bit/8-bit, and weight-only), showing improved generalization on CSR common-sense tasks and MMLU benchmarks. The results demonstrate that LRQ provides robust quantization performance with practical benefits in memory and latency, offering a scalable PTQ pathway for deploying large language models at lower precision.

Abstract

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) 8-bit weight and per-tensor activation quantization, (ii) 4-bit weight and 8-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at Software.

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices

TL;DR

The paper tackles the accuracy degradation of post-training quantization (PTQ) for large language models under weight-activation quantization. It introduces Low-Rank Quantization (LRQ), which reconstructs Transformer block outputs using low-rank weight-scaling matrices to replace full per-weight scales, reducing learnable parameters while preserving per-weight flexibility. LRQ is analyzed via block-wise reconstruction, contrasts with FlexRound, and is validated across multiple quantization schemes (8-bit/8-bit, 4-bit/8-bit, and weight-only), showing improved generalization on CSR common-sense tasks and MMLU benchmarks. The results demonstrate that LRQ provides robust quantization performance with practical benefits in memory and latency, offering a scalable PTQ pathway for deploying large language models at lower precision.

Abstract

With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing weights and activations of LLMs still suffer from non-negligible accuracy drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only needs to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. We show the superiority of LRQ over prior LLM PTQ works under (i) 8-bit weight and per-tensor activation quantization, (ii) 4-bit weight and 8-bit per-token activation quantization, and (iii) low-bit weight-only quantization schemes. Our code is available at Software.
Paper Structure (29 sections, 6 equations, 9 figures, 32 tables)

This paper contains 29 sections, 6 equations, 9 figures, 32 tables.

Figures (9)

  • Figure 1: (a) Zero-shot performance and (b) five-shot accuracy of Llama with $8$-bit per-channel asymmetric weight quantization and $8$-bit per-tensor asymmetric static activation quantization, while keeping the KV cache in FP16.
  • Figure 2: Zero-shot performance and five-shot accuracy of Llama $7$B for FlexRound (FR) on common sense reasoning (CSR) tasks and MMLU according to the calibration sample size, with $8$-bit per-channel asymmetric weight and $8$-bit per-tensor asymmetric static activation quantization, while keeping the KV cache in FP16.
  • Figure 3: Accumulated root mean square error (RMSE) between ${\bm{W}}{\bm{X}}$ and $\widehat{{\bm{W}}}\widetilde{{\bm{X}}}$ for RTN, FlexRound, and LRQ on (a) a calibration sample from the C4 dataset and (b) an unseen sample from common sense reasoning and MMLU benchmarks, ranging from the first Transformer block to the last Transformer block of Llama $7$B. Here, weights and activations are quantized to $8$-bit with per-channel asymmetric quantization and per-tensor asymmetric static quantization, while the KV cache remains in FP16. Note that RMSE tends to rise in line with the block index due to the presence of $\widetilde{{\bm{X}}}$ that accumulates quantization error resulting from previous quantized Transformer blocks.
  • Figure 4: Zero-shot and five-shot performances of Llama $7$B on common sense reasoning (CSR) tasks and MMLU, where weights and activations are quantized to $8$-bit while the KV cache is kept in FP16.
  • Figure 5: Average zero-shot accuracy over latency for Llama $2$$7$B, $13$B, and $70$B, respectively. The blue expresses FP16 baselines while the red represents 4-bit quantized models via LRQ. The size of a circle indicates the model size. More details are given in Appendix \ref{['appendix:test']}.
  • ...and 4 more figures