Table of Contents
Fetching ...

Towards Accurate Post-training Quantization for Reparameterized Models

Luoming Zhang, Yefei He, Wen Fei, Zhenyu Lou, Weijia Wu, YangWei Ying, Hong Zhou

TL;DR

This paper tackles the challenge of post-training quantization (PTQ) for reparameterized networks, where activation outliers driven by BatchNorm interactions cause substantial accuracy loss. It introduces RepAPQ, a framework combining Quantization Protecting Reparameterization (QPRep), which adds a channel-wise affine layer after reparameterized convolutions to stabilize outliers, and Across-block Calibration (ABC), which uses Mean Absolute Error (MAE) to distill information across blocks and stages. The approach outperforms prior PTQ methods, achieving roughly 1% improvements at 8-bit and 2% at 6-bit across models like RepVGG and MobileOne, and maintains competitive accuracy on object detection with Yolov6. The results demonstrate that MAE-based distortion and cross-block feedback enable accurate low-bit quantization of reparameterized networks, with practical implications for deploying fast, memory-efficient models on hardware. The code accompanying RepAPQ is publicly available, signaling a scalable, model-agnostic path toward robust PTQ for modern CNN architectures.

Abstract

Model reparameterization is a widely accepted technique for improving inference speed without compromising performance. However, current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation when applied to reparameterized models. This is primarily caused by channel-specific and sample-specific outliers, which appear only at specific samples and channels and impact on the selection of quantization parameters. To address this issue, we propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models. Different from previous frameworks using Mean Squared Error (MSE) as a measurement, we utilize Mean Absolute Error (MAE) to mitigate the influence of outliers on quantization parameters. Our framework comprises two main components: Quantization Protecting Reparameterization and Across-block Calibration. For effective calibration, Quantization Protecting Reparameterization combines multiple branches into a single convolution with an affine layer. During training, the affine layer accelerates convergence and amplifies the output of the convolution to better accommodate samples with outliers. Additionally, Across-block Calibration leverages the measurement of stage output as supervision to address the gradient problem introduced by MAE and enhance the interlayer correlation with quantization parameters. Comprehensive experiments demonstrate the effectiveness of RepAPQ across various models and tasks. Our framework outperforms previous methods by approximately 1\% for 8-bit PTQ and 2\% for 6-bit PTQ, showcasing its superior performance. The code is available at \url{https://github.com/ilur98/DLMC-QUANT}.

Towards Accurate Post-training Quantization for Reparameterized Models

TL;DR

This paper tackles the challenge of post-training quantization (PTQ) for reparameterized networks, where activation outliers driven by BatchNorm interactions cause substantial accuracy loss. It introduces RepAPQ, a framework combining Quantization Protecting Reparameterization (QPRep), which adds a channel-wise affine layer after reparameterized convolutions to stabilize outliers, and Across-block Calibration (ABC), which uses Mean Absolute Error (MAE) to distill information across blocks and stages. The approach outperforms prior PTQ methods, achieving roughly 1% improvements at 8-bit and 2% at 6-bit across models like RepVGG and MobileOne, and maintains competitive accuracy on object detection with Yolov6. The results demonstrate that MAE-based distortion and cross-block feedback enable accurate low-bit quantization of reparameterized networks, with practical implications for deploying fast, memory-efficient models on hardware. The code accompanying RepAPQ is publicly available, signaling a scalable, model-agnostic path toward robust PTQ for modern CNN architectures.

Abstract

Model reparameterization is a widely accepted technique for improving inference speed without compromising performance. However, current Post-training Quantization (PTQ) methods often lead to significant accuracy degradation when applied to reparameterized models. This is primarily caused by channel-specific and sample-specific outliers, which appear only at specific samples and channels and impact on the selection of quantization parameters. To address this issue, we propose RepAPQ, a novel framework that preserves the accuracy of quantized reparameterization models. Different from previous frameworks using Mean Squared Error (MSE) as a measurement, we utilize Mean Absolute Error (MAE) to mitigate the influence of outliers on quantization parameters. Our framework comprises two main components: Quantization Protecting Reparameterization and Across-block Calibration. For effective calibration, Quantization Protecting Reparameterization combines multiple branches into a single convolution with an affine layer. During training, the affine layer accelerates convergence and amplifies the output of the convolution to better accommodate samples with outliers. Additionally, Across-block Calibration leverages the measurement of stage output as supervision to address the gradient problem introduced by MAE and enhance the interlayer correlation with quantization parameters. Comprehensive experiments demonstrate the effectiveness of RepAPQ across various models and tasks. Our framework outperforms previous methods by approximately 1\% for 8-bit PTQ and 2\% for 6-bit PTQ, showcasing its superior performance. The code is available at \url{https://github.com/ilur98/DLMC-QUANT}.
Paper Structure (23 sections, 2 theorems, 20 equations, 8 figures, 7 tables)

This paper contains 23 sections, 2 theorems, 20 equations, 8 figures, 7 tables.

Key Result

Proposition 1

For a given input that follows a Gaussian mixture distribution $X \sim \sum_{K=1}^{2} \alpha_{x_K}\mathcal{N}(\mu_{x_K}, \sigma_{x_K}^2)$ and a weight of a convolution layer that follows a Gaussian distribution $W \sim \mathcal{N}(\mu_{w}, \sigma_{w}^2)$, the output of the convolution can be obtaine

Figures (8)

  • Figure 1: Box plot for activation to quantize in RepVGGA0. The blue bar is the output of the 3x3 branch, the green bar is the output of the 1x1 branch, the orange bar is the output of the shortcut branch, the gray bar is the fused output, and the violent box is the box with a range of 99%.
  • Figure 2: (a) (b) (c) is maximums and minimums of the different samples in same layer. (d) (e) (f) is maximums and minimums of different channels in different samples of same layers. Orange presents the minimums and blue means the maximums.
  • Figure 3: To detect the impact of clipping the outliers, we set the 50 times bigger than mean as outliers. For outliers, we enumerate the value to cut the clip ratio to cut the activation on several reparameterized models. The clipping reuslts of RoBERTA is from Outlier Suppression wei2022outlier.
  • Figure 4: Boxplot of full precision activation range and initial quantized activation range. We increase the range of boxes from 75% to 99%. For the activation range, we do not count the number of 0 values, because its size does not affect quantization. Subplots are the distribution of quantized activation for the 10th and 20th layers. The quantized activation is quantized with initial quantization parameters.
  • Figure 5: The process of our method. We consider a neural network with stage $n$ consisting of $N$ blocks. For the ABC technique, we utilize the feature maps of intermediate blocks and the stage to distill the quantized block. We only use the output of the distillated block, and as we progress deeper into the blocks, we strengthen the loss function. Regarding the QPRep technique, we first merge the three branches and introduce an affine layer. During inference, the affine parameters are combined with the quantization scales and convolution biases.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2