Table of Contents
Fetching ...

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Yongqi Xu, Yujian Lee, Gao Yi, Bosheng Liu, Yucong Chen, Peng Liu, Jigang Wu, Xiaoming Chen, Yinhe Han

TL;DR

A BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms is developed and an optimization problem is formulated to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss.

Abstract

Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at https://github.com/Cheliosoops/BitQ.

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

TL;DR

A BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms is developed and an optimization problem is formulated to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss.

Abstract

Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ'') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at https://github.com/Cheliosoops/BitQ.
Paper Structure (14 sections, 8 equations, 6 figures, 4 tables)

This paper contains 14 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of BFP quantization configuration selection strategies: The figure illustrates the empirically-driven strategy in the upper section and the search framework strategy in the lower section, with line charts showcasing the performance gap on ResNet-18 resnet18 between the two selection methods.
  • Figure 2: Workflow of the trade-off optimization of BitQ. (a) BFP quantization configuration. (b) Data movement ($DM$) expression. And (c) basis of convolution (c1) and data reuse in tiling (c2).
  • Figure 3: BFP data representation. (a) Illustration of 8-bit and 16-bit BFP data representation. (b) Process of generating BFP data from original data.
  • Figure 4: Visualization results of Original and $\text{BitQ}_{16(8)}$ on downstream tasks.
  • Figure 5: Normalized energy comparison under (a) image classification, (b) object detection, (c) instance segmentation, and (d) semantic segmentation. Less energy consumption is preferable. Gmean identifies the geometric mean across various models for corresponding visual tasks.
  • ...and 1 more figures