Table of Contents
Fetching ...

Magic for the Age of Quantized DNNs

Yoshihide Sawada, Ryuji Saiin, Kazuma Suetake

TL;DR

This work tackles the challenge of deploying large DNNs on devices with limited compute by developing a quantization-aware training framework named MaQD. MaQD combines Layer-Batch Normalization (LBN), Weight Standardization (WS), scaled round-clip quantization, and surrogate-gradient optimization to produce highly compressed networks with minimal accuracy loss. The authors demonstrate that LBN remains effective with small mini-batch sizes and that MaQD can quantize both weights and activations across a range of state counts, achieving near-original performance on CIFAR datasets and enabling SNN-like inference in low-bit regimes. The approach holds promise for hardware-friendly AI on edge devices and could extend to challenging tasks, such as ImageNet-scale recognition and large language models, with further hardware-aware optimizations.

Abstract

Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.

Magic for the Age of Quantized DNNs

TL;DR

This work tackles the challenge of deploying large DNNs on devices with limited compute by developing a quantization-aware training framework named MaQD. MaQD combines Layer-Batch Normalization (LBN), Weight Standardization (WS), scaled round-clip quantization, and surrogate-gradient optimization to produce highly compressed networks with minimal accuracy loss. The authors demonstrate that LBN remains effective with small mini-batch sizes and that MaQD can quantize both weights and activations across a range of state counts, achieving near-original performance on CIFAR datasets and enabling SNN-like inference in low-bit regimes. The approach holds promise for hardware-friendly AI on edge devices and could extend to challenging tasks, such as ImageNet-scale recognition and large language models, with further hardware-aware optimizations.

Abstract

Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.
Paper Structure (18 sections, 13 equations, 6 figures, 4 tables)

This paper contains 18 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Range per normalization layer. Green areas are used for normalization, where $C$ denotes the channel axis, $N$ denotes the batch axes, and $H$ and $W$ denote the height and width axies, respectively. From left to right, BN, LN, and LBN are represented, respectively.
  • Figure 2: Weight
  • Figure 3: Activation function
  • Figure 5: Left: Losses for training with varying mini-batch sizes for each normalization method. The solid and dashed lines represent the loss for the training and test datasets, respectively. Right: GPU memory usage when the mini-batch size is varied for each normalization method.
  • Figure 6: Examples of non-zero ratios for each layer when varying $M_{\rm w}$ and fixing $M_{\rm a}$. The dataset is CIFAR-10, and the network architectures are VGG (top) and PreActResNet (bottom). Note that since the output layer is not quantized, the graph of $R_{\rm a}$ has one less layer index.
  • ...and 1 more figures