Table of Contents
Fetching ...

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

Chen Tang, Yuan Meng, Jiacheng Jiang, Shuzhao Xie, Rongwei Lu, Xinzhu Ma, Zhi Wang, Wenwu Zhu

TL;DR

The paper addresses the high retraining cost of mixed-precision quantization by introducing a one-shot training-searching pipeline that learns a weight-sharing MPQ model across multiple bit-width configurations. A dynamic bit-width scheduler freezes interfering bit-widths and an information distortion mitigation term aligns poor and well-performing bit-widths during training, enabling stable optimization without retraining. An inference-only bidirectional greedy search then selects per-layer bit-widths under a BitOps constraint, yielding competitive accuracy with substantially reduced deployment cost. Experiments across ImageNet, transfer tasks, and ablations validate the effectiveness of the approach and its potential for efficient MPQ on devices with limited resources.

Abstract

Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}.

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning

TL;DR

The paper addresses the high retraining cost of mixed-precision quantization by introducing a one-shot training-searching pipeline that learns a weight-sharing MPQ model across multiple bit-width configurations. A dynamic bit-width scheduler freezes interfering bit-widths and an information distortion mitigation term aligns poor and well-performing bit-widths during training, enabling stable optimization without retraining. An inference-only bidirectional greedy search then selects per-layer bit-widths under a BitOps constraint, yielding competitive accuracy with substantially reduced deployment cost. Experiments across ImageNet, transfer tasks, and ablations validate the effectiveness of the approach and its potential for efficient MPQ on devices with limited resources.

Abstract

Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely, mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically, in the first stage, all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However, our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization, leading to considerable performance degradation under a high compression ratio. To tackle this problem, we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training, to ensure the rest bit-widths converged properly. Then, taking inspiration from information theory, we present an information distortion mitigation technique to align the behavior of the bad-performing bit-widths to the well-performing ones. In the second stage, an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. Code can be available on \href{https://www.github.com/1hunters/retraining-free-quantization}{https://github.com/1hunters/retraining-free-quantization}.
Paper Structure (15 sections, 9 equations, 4 figures, 7 tables)

This paper contains 15 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) 2D regression of single 4-bits quantization, (b) 2D regression of 4-bits quantization with an additional 2-bits (i.e., weight-sharing quantization), and (c) the L2-normalized gradients of these two regressions. Compared with \ref{['fig:2d_regression']}(a), the weight in \ref{['fig:2d_regression']}(b) is more unstable due to the bit-width interference. Notably, the gradient of 4-bits also has a larger variance under weight-sharing.
  • Figure 2: Distance between full-precision latent weights and quantized weights on MobileNetV2 of a point-wise conv layer. Left: 4-bits. Right: 6-bits.
  • Figure 3: Output density at 2bit and 6bits. Small bit-width shows noteworthy information distortion.
  • Figure 4: Output density at 2bit and 6bits with our IDM training. Compared with \ref{['fig:density_of_2_6bits']}, information distortion of the small bit-widths is significantly mitigated.

Theorems & Definitions (1)

  • Definition 3.1: Bit-width Representation Set