Table of Contents
Fetching ...

AffineQuant: Affine Transformation Quantization for Large Language Models

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji

TL;DR

AffineQuant introduces an invertible affine-transform-based PTQ for large language models, expanding the optimization space beyond simple scaling to minimize quantization error. A Gradual Mask, grounded in the Levy-Desplanques theorem, preserves diagonal dominance to guarantee invertibility during training. Empirical results show state-of-the-art PTQ performance across multiple models and bit-quantization regimes, notably improving small models and low-bit settings without additional inference overhead. The approach unifies and extends prior equivalent-quantization methods, offering practical gains for edge deployment of LLMs.

Abstract

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art benchmark for PTQ in LLMs.

AffineQuant: Affine Transformation Quantization for Large Language Models

TL;DR

AffineQuant introduces an invertible affine-transform-based PTQ for large language models, expanding the optimization space beyond simple scaling to minimize quantization error. A Gradual Mask, grounded in the Levy-Desplanques theorem, preserves diagonal dominance to guarantee invertibility during training. Empirical results show state-of-the-art PTQ performance across multiple models and bit-quantization regimes, notably improving small models and low-bit settings without additional inference overhead. The approach unifies and extends prior equivalent-quantization methods, offering practical gains for edge deployment of LLMs.

Abstract

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. To illustrate, we attain a C4 perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art benchmark for PTQ in LLMs.
Paper Structure (19 sections, 1 theorem, 13 equations, 7 figures, 11 tables)

This paper contains 19 sections, 1 theorem, 13 equations, 7 figures, 11 tables.

Key Result

Theorem 1

When the stability factor $\alpha$ is small enough, if $N_{e}$ is strictly diagonally dominant, then $N_{e+1}$ is strictly diagonally dominant.

Figures (7)

  • Figure 1: The effect of scaling, translation and affine transformation on the quantization of the weights. The term "Fixed Point" refers to the $2^n-1$ quantization levels in $n$-bit quantization. $s$, $b$, and $A$ are the scaling factor, translation factor, and affine transformation matrix, respectively. We assume that the input channel and output channel of $W$ is $2$. We consider each output channel as a two-dimensional vector.
  • Figure 2: The gradual mask operates on the affine transformation matrix, gradually incorporating the elements of matrix $A$ near the diagonal into the training process as training progresses.
  • Figure 3: Mean square error loss of the last transformer block of LLaMA-7b and OPT-1.3b. "w2a16’’ means $2$-bit weight-only quantization. "w3a16g128’’ means $3$-bit grouping $128$ weight-only quantization. We optimize $40$ and $20$ epochs in the last block of LLaMA-7b and OPT-1.3b, respectively.
  • Figure 4: PPL vs. weight-memory Pareto-optimal curves for LLaMA1&2 models of different sizes in the $4$/$4$ bit quantization configuration on C4 and WikiText2.
  • Figure 5: The relationship between WikiText2 PPL and quantization loss of last transformer block on LLaMA-7B and OPT-6.7B with 4/4 bit quantization.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 1
  • Theorem 1
  • Proof 1