MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

Aozhong Zhang; Naigang Wang; Yanxia Deng; Xin Li; Zi Yang; Penghang Yin

MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, Penghang Yin

TL;DR

A simple optimization-based preprocessing technique called Weight Magnitude Reduction (MagR) to improve the performance of post-training quantization, which functions as a non-linear transformation, eliminating the need for any additional post-processing.

Abstract

In this paper, we present a simple optimization-based preprocessing technique called Weight Magnitude Reduction (MagR) to improve the performance of post-training quantization. For each linear layer, we adjust the pre-trained floating-point weights by solving an $\ell_\infty$-regularized optimization problem. This process greatly diminishes the maximum magnitude of the weights and smooths out outliers, while preserving the layer's output. The preprocessed weights are centered more towards zero, which facilitates the subsequent quantization process. To implement MagR, we address the $\ell_\infty$-regularization by employing an efficient proximal gradient descent algorithm. Unlike existing preprocessing methods that involve linear transformations and subsequent post-processing steps, which can introduce significant overhead at inference time, MagR functions as a non-linear transformation, eliminating the need for any additional post-processing. This ensures that MagR introduces no overhead whatsoever during inference. Our experiments demonstrate that MagR achieves state-of-the-art performance on the Llama family of models. For example, we achieve a Wikitext2 perplexity of 5.95 on the LLaMA2-70B model for per-channel INT2 weight quantization without incurring any inference overhead.

MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

TL;DR

Abstract

-regularized optimization problem. This process greatly diminishes the maximum magnitude of the weights and smooths out outliers, while preserving the layer's output. The preprocessed weights are centered more towards zero, which facilitates the subsequent quantization process. To implement MagR, we address the

-regularization by employing an efficient proximal gradient descent algorithm. Unlike existing preprocessing methods that involve linear transformations and subsequent post-processing steps, which can introduce significant overhead at inference time, MagR functions as a non-linear transformation, eliminating the need for any additional post-processing. This ensures that MagR introduces no overhead whatsoever during inference. Our experiments demonstrate that MagR achieves state-of-the-art performance on the Llama family of models. For example, we achieve a Wikitext2 perplexity of 5.95 on the LLaMA2-70B model for per-channel INT2 weight quantization without incurring any inference overhead.

Paper Structure (15 sections, 17 equations, 2 figures, 9 tables, 3 algorithms)

This paper contains 15 sections, 17 equations, 2 figures, 9 tables, 3 algorithms.

Introduction
Related Work
Background
The Proposed Method
Approximately Rank-Deficient Feature Matrix
MagR via $\ell_{\infty}$-Regularization
Experiments
Language Generation
Zero-Shot Tasks
Preprocessing and Quantization Runtime
Concluding Remarks
Appendix / supplemental material
Projection of Vectors Onto $\ell_1$-Ball
Additional Experimental Results
Ablation Study

Figures (2)

Figure 1: Motivation behind MagR: we can effectively reduce the magnitude of weights at the preprocessing stage. Each point denotes the maximum magnitude before ($x$-coordinate) and after ($y$-coordinate) applying MagR within a sampled channel (or column) of the weight matrix from three random layers of LLaMa2-7B touvron2023llama2. These column-wise maximum magnitudes are typically more than halved through MagR.
Figure 2: Layer-wise quantization errors (root mse) for MagR+OPTQ and OPTQ, respectively, for 4-bit quantization. The layers are selected randomly for visualization, but improvement is consistent across all layers.

MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

TL;DR

Abstract

MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)