DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization
Behnam Ghavami, Amin Kamjoo, Lesley Shannon, Steve Wilton
TL;DR
The paper tackles memory constraints for deploying DNNs on edge devices and proposes PTILMPQ, a post-training intra-layer multi-precision quantization that assigns higher bit-widths to important layers and channels while using non-overlapping weight regions to preserve accuracy. The approach unfolds in two phases: first identifying important layers and channels, then quantizing within two dense/sparse regions with a shared boundary. It demonstrates substantial memory reductions (e.g., ~37% for ResNet50 and ~81% for MobileNetV2) with minimal accuracy loss under mixed-precision settings, and introduces an alpha-based mechanism to balance accuracy and model size, alongside a BOPs-based complexity analysis for practical deployment. The method enables efficient edge deployment with privacy considerations by reducing memory without requiring extensive retraining or data, validated on ImageNet with standard models.
Abstract
The imperative to deploy Deep Neural Network (DNN) models on resource-constrained edge devices, spurred by privacy concerns, has become increasingly apparent. To facilitate the transition from cloud to edge computing, this paper introduces a technique that effectively reduces the memory footprint of DNNs, accommodating the limitations of resource-constrained edge devices while preserving model accuracy. Our proposed technique, named Post-Training Intra-Layer Multi-Precision Quantization (PTILMPQ), employs a post-training quantization approach, eliminating the need for extensive training data. By estimating the importance of layers and channels within the network, the proposed method enables precise bit allocation throughout the quantization process. Experimental results demonstrate that PTILMPQ offers a promising solution for deploying DNNs on edge devices with restricted memory resources. For instance, in the case of ResNet50, it achieves an accuracy of 74.57\% with a memory footprint of 9.5 MB, representing a 25.49\% reduction compared to previous similar methods, with only a minor 1.08\% decrease in accuracy.
