Table of Contents
Fetching ...

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton

TL;DR

The paper tackles memory constraints for deploying DNNs on edge devices and proposes PTILMPQ, a post-training intra-layer multi-precision quantization that assigns higher bit-widths to important layers and channels while using non-overlapping weight regions to preserve accuracy. The approach unfolds in two phases: first identifying important layers and channels, then quantizing within two dense/sparse regions with a shared boundary. It demonstrates substantial memory reductions (e.g., ~37% for ResNet50 and ~81% for MobileNetV2) with minimal accuracy loss under mixed-precision settings, and introduces an alpha-based mechanism to balance accuracy and model size, alongside a BOPs-based complexity analysis for practical deployment. The method enables efficient edge deployment with privacy considerations by reducing memory without requiring extensive retraining or data, validated on ImageNet with standard models.

Abstract

The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57\% with a memory footprint of 9.5 MB, representing a 25.49\% reduction compared to previous similar methods, with only a minor 1.08\% decrease in accuracy.

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

TL;DR

The paper tackles memory constraints for deploying DNNs on edge devices and proposes PTILMPQ, a post-training intra-layer multi-precision quantization that assigns higher bit-widths to important layers and channels while using non-overlapping weight regions to preserve accuracy. The approach unfolds in two phases: first identifying important layers and channels, then quantizing within two dense/sparse regions with a shared boundary. It demonstrates substantial memory reductions (e.g., ~37% for ResNet50 and ~81% for MobileNetV2) with minimal accuracy loss under mixed-precision settings, and introduces an alpha-based mechanism to balance accuracy and model size, alongside a BOPs-based complexity analysis for practical deployment. The method enables efficient edge deployment with privacy considerations by reducing memory without requiring extensive retraining or data, validated on ImageNet with standard models.

Abstract

The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57\% with a memory footprint of 9.5 MB, representing a 25.49\% reduction compared to previous similar methods, with only a minor 1.08\% decrease in accuracy.
Paper Structure (9 sections, 13 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 9 sections, 13 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: A general view of proposed method. Based on the function $F_l$ (Equation \ref{['eq:layers']}), layers are divided into important and non-important categories. Selection of the most important channels is performed for each layer. The quantization process is performed according to important channels on non-overlapping regions (Sparse-Region and Dense-Region)
  • Figure 2: Weight distribution for MobilenetV2 and Resnet50. As can be seen, they look bell-shaped.
  • Figure 3: Dense-Region and Sparse-Region
  • Figure 4: Trade-off between accuracy and model size. Dotted line represents quantization with single-precision, and the blue line represents mixed-precision quantization with different alphas.
  • Figure 5: The graph showcases the computed BOPs for both Single-Precision (SP) and our proposed Mixed-Precision (MP) method, providing insight into the trade-off between accuracy and computational complexity for MobileNetV2 and ResNet50 architectures.