Table of Contents
Fetching ...

DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations

Wenhao Hu, Paul Henderson, José Cano

TL;DR

DQA tackles the challenge of efficiently quantizing DNN activations at sub-6-bit resolutions for resource-constrained devices. It offline-ranks activation channels by importance via a greedy search, then quantizes important channels with $n+m$ bits and right-shifts to the target bit width, storing and Huffman-coding the resulting shifting errors to limit memory overhead. Unimportant channels follow a direct quantization path, enabling a mixed-precision scheme that preserves accuracy while reducing computation. Across CIFAR-10 and CityScapes with ResNet-32, MobileNetV2, and U-Net, DQA achieves substantial accuracy gains over direct quantization and NoisyQuant, especially at lower bitwidths, and demonstrates feasible memory efficiency through Huffman coding of shifting errors. This approach promises practical deployment on devices with tight energy, memory, and compute budgets, with future work focusing on hardware co-design and latency measurements.

Abstract

Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.

DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations

TL;DR

DQA tackles the challenge of efficiently quantizing DNN activations at sub-6-bit resolutions for resource-constrained devices. It offline-ranks activation channels by importance via a greedy search, then quantizes important channels with bits and right-shifts to the target bit width, storing and Huffman-coding the resulting shifting errors to limit memory overhead. Unimportant channels follow a direct quantization path, enabling a mixed-precision scheme that preserves accuracy while reducing computation. Across CIFAR-10 and CityScapes with ResNet-32, MobileNetV2, and U-Net, DQA achieves substantial accuracy gains over direct quantization and NoisyQuant, especially at lower bitwidths, and demonstrates feasible memory efficiency through Huffman coding of shifting errors. This approach promises practical deployment on devices with tight energy, memory, and compute budgets, with future work focusing on hardware co-design and latency measurements.

Abstract

Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.

Paper Structure

This paper contains 11 sections, 4 equations, 2 figures, 4 tables, 3 algorithms.

Figures (2)

  • Figure 1: DQA overview. 1 offline, rank the activation channels based on importance using training/calibration data and a greedy search algorithm (green circles represent the most important channels for which we skip quantization); 2 during inference, quantize important activation channels with $m$ extra bits and then right-shift them while saving the shifting errors; 3 the shifting errors are Huffman-encoded to reduce the memory requirement; 4 de-quantize activation channels. For important channels, decode the Huffman-encoded shifting errors and add them to the quantized activation channel values. For non-important channels, use the direct method to de-quantize.
  • Figure 2: Average frequency distribution of shifting errors for ResNet-32 and CIFAR-10 with 3, 4, and 5 bits quantization and $m=3$.