Table of Contents
Fetching ...

Minimize Quantization Output Error with Bias Compensation

Cheng Gong, Haoshuai Zheng, Mengting Hu, Zheng Lin, Deng-Ping Fan, Yuzhi Zhang, Tao Li

TL;DR

The results show that the method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models and the results show that the method notably reduces quantization output error.

Abstract

Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in https://github.com/GongCheng1919/bias-compensation.

Minimize Quantization Output Error with Bias Compensation

TL;DR

The results show that the method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models and the results show that the method notably reduces quantization output error.

Abstract

Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs), but it often leads to significant output error that hinder model deployment. In this paper, we propose Bias Compensation (BC) to minimize the output error, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error,without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.The code is in https://github.com/GongCheng1919/bias-compensation.
Paper Structure (25 sections, 12 equations, 6 figures, 2 tables)

This paper contains 25 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison between the previous PTQ methods and our proposed bias compensation. (A) indicates local quantizer optimization methods. (B) and (C) are layer-wise quantizer and parameter optimization methods, respectively. Previous methods optimize the quantizer parameter or layer-wise weights to minimize the quantization loss or output error, which is non-convex and difficult to solve. Our method shown in (D) directly minimizes the output error by solving the best bias vector, which is convex and guarantees minimal output error.
  • Figure 2: Illustration of bias compensation. We use absolute error as output error for easy understanding. Applying bias compensation after quantization can significantly reduce the output error without increasing additional computational complexity.
  • Figure 3: Bias compensation positions in Transformer.
  • Figure 4: The attention output distribution of the first layer of ViT-S Transformer using 4-bit quantization. The output of PTQ4ViT significantly deviates from the float output. BC compensates for the output of PTQ4ViT and aligns it with the float one.
  • Figure 5: Layer-wise output errors of OPT-125M with different quantization methods on calibration dataset. BC greatly decreases the output errors across all layers.
  • ...and 1 more figures