Table of Contents
Fetching ...

Nearly Lossless Adaptive Bit Switching

Haiduo Huang, Zhenhua Liu, Tian Xia, Wenzhe zhao, Pengju Ren

TL;DR

The Double Rounding quantization method is introduced, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision.

Abstract

Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at https://github.com/haiduo/Double-Rounding.

Nearly Lossless Adaptive Bit Switching

TL;DR

The Double Rounding quantization method is introduced, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision.

Abstract

Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at https://github.com/haiduo/Double-Rounding.

Paper Structure

This paper contains 25 sections, 6 equations, 16 figures, 9 tables, 5 algorithms.

Figures (16)

  • Figure 1: Overview of our proposed lossless adaptive bit-switching strategy.
  • Figure 2: Comparison of four quantization schemes:(from left to right) used in LSQEsser2019, AdaBitsJin2020, Bit-MixerBulat2021 and Ours Double Rounding. In all cases $y=dequant(quant(x))$.
  • Figure 3: The statistics of ResNet18 on ImageNet-1K dataset. (a) and (b): The quantization scale gradients' statistics for the weights, with outliers removed for clarity. (c) and (d): The multi-precision training processes of our Double Rounding without and with the ALRS strategy.
  • Figure 4: The HASB stochastic process and Mixed-precision of ResNet18 for {2,4,6,8}-bit.
  • Figure 5: Comparison of HASB and Baseline approaches for Mixed-Precision on ResNet18.
  • ...and 11 more figures