Magic for the Age of Quantized DNNs
Yoshihide Sawada, Ryuji Saiin, Kazuma Suetake
TL;DR
This work tackles the challenge of deploying large DNNs on devices with limited compute by developing a quantization-aware training framework named MaQD. MaQD combines Layer-Batch Normalization (LBN), Weight Standardization (WS), scaled round-clip quantization, and surrogate-gradient optimization to produce highly compressed networks with minimal accuracy loss. The authors demonstrate that LBN remains effective with small mini-batch sizes and that MaQD can quantize both weights and activations across a range of state counts, achieving near-original performance on CIFAR datasets and enabling SNN-like inference in low-bit regimes. The approach holds promise for hardware-friendly AI on edge devices and could extend to challenging tasks, such as ImageNet-scale recognition and large language models, with further hardware-aware optimizations.
Abstract
Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs (Large Language Models), making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.
