DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations
Wenhao Hu, Paul Henderson, José Cano
TL;DR
DQA tackles the challenge of efficiently quantizing DNN activations at sub-6-bit resolutions for resource-constrained devices. It offline-ranks activation channels by importance via a greedy search, then quantizes important channels with $n+m$ bits and right-shifts to the target bit width, storing and Huffman-coding the resulting shifting errors to limit memory overhead. Unimportant channels follow a direct quantization path, enabling a mixed-precision scheme that preserves accuracy while reducing computation. Across CIFAR-10 and CityScapes with ResNet-32, MobileNetV2, and U-Net, DQA achieves substantial accuracy gains over direct quantization and NoisyQuant, especially at lower bitwidths, and demonstrates feasible memory efficiency through Huffman coding of shifting errors. This approach promises practical deployment on devices with tight energy, memory, and compute budgets, with future work focusing on hardware co-design and latency measurements.
Abstract
Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.
