Table of Contents
Fetching ...

In-Distribution Consistency Regularization Improves the Generalization of Quantization-Aware Training

Junbiao Pang, Tianyang Cai, Baochang Zhang, Jiaqi Wu

TL;DR

This work tackles the generalization gap in Quantization-Aware Training (QAT) by introducing Consistency Regularization (CR), which enforces stable predictions for two augmented views of the same input through a teacher-student framework with Exponential Moving Average (EMA) updates. By incorporating in-distribution unlabeled data and a KL-based consistency loss between the student and teacher, CR promotes a flatter loss landscape and reduced sensitivity to input and weight perturbations, theoretically linking consistency to reduced sharpness. Empirically, CR delivers state-of-the-art improvements across CIFAR-10/100 and ImageNet, often surpassing or closely matching FP32 performance on several architectures, and shows strong gains with unlabeled data and carefully scheduled CR strength. The method is simple to adopt, adaptable to various QAT pipelines, and holds practical impact for deploying efficient low-bit models on edge devices while leveraging unlabeled data for better generalization.

Abstract

Although existing Quantization-Aware Training (QAT) methods intensively depend on knowledge distillation to guarantee performance, QAT still suffers from severe performance drop. The experiments have shown that vanilla quantization is sensitive to the perturbation from both the input and weights. Therefore, we assume that the generalization ability of QAT is predominantly caused by both the intrinsic instability (training time) and the limited generalization ability (testing time). In this paper, we address both issues from a new perspective by leveraging Consistency Regularization (CR) to improve the generalization ability of QAT. Empirical results and theoretical analysis verify that CR would bring a good generalization ability to different network architectures and various QAT methods. Extensive experiments demonstrate that our approach significantly outperforms current state-of-the-art QAT methods and even the FP counterparts. On CIFAR-10, the proposed method improves by 3.79% compared to the baseline method using ResNet18, and improves by 3.84% compared to the baseline method using the lightweight model MobileNet.

In-Distribution Consistency Regularization Improves the Generalization of Quantization-Aware Training

TL;DR

This work tackles the generalization gap in Quantization-Aware Training (QAT) by introducing Consistency Regularization (CR), which enforces stable predictions for two augmented views of the same input through a teacher-student framework with Exponential Moving Average (EMA) updates. By incorporating in-distribution unlabeled data and a KL-based consistency loss between the student and teacher, CR promotes a flatter loss landscape and reduced sensitivity to input and weight perturbations, theoretically linking consistency to reduced sharpness. Empirically, CR delivers state-of-the-art improvements across CIFAR-10/100 and ImageNet, often surpassing or closely matching FP32 performance on several architectures, and shows strong gains with unlabeled data and carefully scheduled CR strength. The method is simple to adopt, adaptable to various QAT pipelines, and holds practical impact for deploying efficient low-bit models on edge devices while leveraging unlabeled data for better generalization.

Abstract

Although existing Quantization-Aware Training (QAT) methods intensively depend on knowledge distillation to guarantee performance, QAT still suffers from severe performance drop. The experiments have shown that vanilla quantization is sensitive to the perturbation from both the input and weights. Therefore, we assume that the generalization ability of QAT is predominantly caused by both the intrinsic instability (training time) and the limited generalization ability (testing time). In this paper, we address both issues from a new perspective by leveraging Consistency Regularization (CR) to improve the generalization ability of QAT. Empirical results and theoretical analysis verify that CR would bring a good generalization ability to different network architectures and various QAT methods. Extensive experiments demonstrate that our approach significantly outperforms current state-of-the-art QAT methods and even the FP counterparts. On CIFAR-10, the proposed method improves by 3.79% compared to the baseline method using ResNet18, and improves by 3.84% compared to the baseline method using the lightweight model MobileNet.
Paper Structure (14 sections, 4 theorems, 29 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 14 sections, 4 theorems, 29 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

If KL divergence is used in CR loss eqt:crloss and CE is used in eqt:compositedloss, optimizing CR loss eqt:crloss is equal to minimize the flatness of a network, i.e.,

Figures (8)

  • Figure 1: Comparison between KD and CR, where $x$ represents a origin sample, $x^{Aug}$ represents the augmented sample. (a) KD is used on individual augmented samples. (b) CR models the consistency between two augmented samples.
  • Figure 2: Comparison between vanilla QAT (a), quantization with KD polino-qkd-arxiv-2018(b), and our method (c).
  • Figure 3: The augmented sample.
  • Figure 4: Different settings of the strength $\lambda$ of CR.
  • Figure 5: Frequency plot of the top 50 eigenvalues for FP32, LSQ, and CR. The plot illustrate the maximum, mean, and variance of the top 50 eigenvalues.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 1: CR is equal to optimize flatness in \ref{['eqt:flatness_loss_decomposition']}
  • proof
  • Proposition 2: generalization
  • proof
  • Proposition 3: The perturbation in the input space is interchangeable with the one in the weight space
  • proof
  • Proposition 4: Expansion of perturbation from the input to all layers of a network
  • proof