FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Jung Hyun Lee; Jeonghoon Kim; Se Jung Kwon; Dongsoo Lee

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, Dongsoo Lee

TL;DR

FlexRound introduces a learnable, division-based rounding scheme for post-training quantization that jointly optimizes a common grid size $s_1$ and per-weight division factors, enabling flexible quantization that accounts for weight magnitudes. By formulating $\oldsymbol{\hat{W}} = s_1 \lfloor {\boldsymbol{W}} /(\boldsymbol{S}) \rceil$ with $\boldsymbol{S}$ built from learnable components, and leveraging a gradient that scales with weight magnitude, FlexRound achieves superior reconstruction quality across vision and language models in a per-tensor uniform PTQ setting. Extensive ablations show the importance of learning $s_1$ and the added tensors ${\bm{s}}_3$, ${\bm{s}}_4$, while experiments on ResNet, MobileNetV2, BERT, GPT-Neo, OPT, GPT-2, and LLaMA demonstrate robust performance gains over AdaRound and AdaQuant, including successful quantization of large language models with minimal accuracy loss. The results underscore FlexRound’s broad practical impact for deploying quantized models on resource-constrained devices and in settings with limited data or compute for PTQ.

Abstract

Post-training quantization (PTQ) has been gaining popularity for the deployment of deep neural networks on resource-limited devices since unlike quantization-aware training, neither a full training dataset nor end-to-end training is required at all. As PTQ schemes based on reconstructing each layer or block output turn out to be effective to enhance quantized model performance, recent works have developed algorithms to devise and learn a new weight-rounding scheme so as to better reconstruct each layer or block output. In this work, we propose a simple yet effective new weight-rounding mechanism for PTQ, coined \emph{FlexRound}, based on element-wise division instead of typical element-wise addition such that FlexRound enables jointly learning a common quantization grid size as well as a different scale for each pre-trained weight. Thanks to the reciprocal rule of derivatives induced by element-wise division, FlexRound is inherently able to exploit pre-trained weights when updating their corresponding scales, and thus, flexibly quantize pre-trained weights depending on their magnitudes. We empirically validate the efficacy of FlexRound on a wide range of models and tasks. To the best of our knowledge, our work is the first to carry out comprehensive experiments on not only image classification and natural language understanding but also natural language generation. Moreover, we demonstrate, for the first time, that large language models can be efficiently quantized, with only a negligible impact on performance compared to half-precision baselines, achieved by reconstructing the output in a block-by-block manner. Our code is available at \url{https://github.com/onliwad101/FlexRound_LRQ}.

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

TL;DR

FlexRound introduces a learnable, division-based rounding scheme for post-training quantization that jointly optimizes a common grid size

and per-weight division factors, enabling flexible quantization that accounts for weight magnitudes. By formulating

with

built from learnable components, and leveraging a gradient that scales with weight magnitude, FlexRound achieves superior reconstruction quality across vision and language models in a per-tensor uniform PTQ setting. Extensive ablations show the importance of learning

and the added tensors

, while experiments on ResNet, MobileNetV2, BERT, GPT-Neo, OPT, GPT-2, and LLaMA demonstrate robust performance gains over AdaRound and AdaQuant, including successful quantization of large language models with minimal accuracy loss. The results underscore FlexRound’s broad practical impact for deploying quantized models on resource-constrained devices and in settings with limited data or compute for PTQ.

Abstract

Paper Structure (30 sections, 1 theorem, 3 equations, 7 figures, 24 tables)

This paper contains 30 sections, 1 theorem, 3 equations, 7 figures, 24 tables.

Introduction
Related Work
Methodology
Preliminaries
Notations
PTQ Background
FlexRound
Experiments
Ablation Study
Ablation Study 1
Ablation Study 2
ResNet and MobileNetV2 on ImageNet
Language Models
BERT and GPT-Neo on GLUE
GPT-Neo and OPT on WikiText2 and PTB
...and 15 more sections

Key Result

Proposition 3.1

Let $\mathcal{L}$ be the reconstruction error computed from Eq. eq:flexround2 and ${\bm{S}}'$ be the matrix (or tensor) scaling pre-trained weights ${\bm{W}}$ in Eq. eq:flexround2, i.e., ${\bm{S}}' = {\bm{S}}_2 \odot {\bm{s}}_3$ (or ${\bm{S}}_2 \odot {\bm{s}}_3 \odot {\bm{s}}_4$). Then, the gradient

Figures (7)

Figure 1: Illustration of FlexRound in the per-tensor uniform PTQ reconstruction. $s_1$ is a common quantization grid size across a layer, and $S_{(i, j)}$ is the division factor for a pre-trained weight $W_{(i, j)}$, both of which are positive and learnable. As shown in (b), with different learned $S_{(i, j)}$ via (a), FlexRound flexibly quantizes pre-trained weights by observing $W_{(2, 4)} < W_{(3, 2)}$ but $\widehat{W}_{(2, 4)} > \widehat{W}_{(3, 2)}$.
Figure 2: Formation of ${\bm{S}}$ in Eq. \ref{['eq:flexround']} for a linear layer ${\bm{W}}$. $s_1$ is a common quantization grid size across a layer, ${\bm{S}}_2$ is the matrix scaling ${\bm{W}}$, and ${\bm{s}}_3$ is an additional vector supporting ${\bm{S}}_2$ to account for the variation of output channel's statistics in ${\bm{W}}$. As a result, ${\bm{S}} = s_1 \odot {\bm{S}}_2 \odot {\bm{s}}_3$ is the division factor for a linear layer ${\bm{W}}$.
Figure 3: Weight updates through FlexRound of the first 2D convolution in the first block of (a) MobileNetV2 and (b) ResNet-18, after quantizing pre-trained weights to $4$-bit (via FlexRound) while activations are kept in full-precision.
Figure 4: Amount of grid shifts from the grids obtainable from RTN in the second 2D convolution of the sixth block of MobileNetV2 when only weights are quantized to $4$-bit via FlexRound. Unlike the right side of Figure \ref{['fig:histogram']}, weights of large magnitude are quantized with similar flexibility to those of moderate magnitude.
Figure 5: Number of grid shifts from the grids attainable from RTN in the query projection of the first self-attention layer of $\text{BERT}_{\text{BASE}}$ fine-tuned on the MRPC dataset when quantizing both weights and input activations of self-attention and feed-forward layers to $8$-bit via FlexRound. FlexRound can provide up to about $60$ grid shifts from the grids obtainable from RTN.
...and 2 more figures

Theorems & Definitions (1)

Proposition 3.1

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

TL;DR

Abstract

FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (1)