Table of Contents
Fetching ...

Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers

Wei Tao, Shenglin He, Kai Lu, Xiaoyang Qu, Guokuan Li, Jiguang Wan, Jianzong Wang, Jing Xiao

TL;DR

This work tackles the challenge of deploying neural networks on resource-constrained MCUs by addressing the redundant computation in patch-based inference. It introduces QuantMCU, which pairs value-driven patch classification (VDPC) with value-driven quantization search (VDQS) to reduce computation and memory without sacrificing accuracy; VDPC assigns patches to outlier or non-outlier classes, applying 8-bit quantization to outliers and their downstream branches, while VDQS uses an entropy-based, training-free metric to guide a lightweight per-feature-map bitwidth selection under memory constraints. The quantization score integrates computation and accuracy via $S(i,b) = -\lambda \Omega(i,b) + (1-\lambda) \Phi(i,b)$, with $\Phi(i,b)$ and $\Omega(i,b)$ defined through BitOPs reductions and entropy changes, respectively. Experimental results on real MCU devices with ImageNet and Pascal VOC demonstrate $\approx 2.2\times$ BitOPs reduction and $\approx 1.5\times$ latency reduction on average over state-of-the-art patch-based methods, while maintaining comparable accuracy. The approach enables practical, low-latency, memory-efficient patch-based inference on MCUs, highlighting the value of combining patch-level value awareness with entropy-guided quantization.

Abstract

Deploying neural networks on microcontroller units (MCUs) presents substantial challenges due to their constrained computation and memory resources. Previous researches have explored patch-based inference as a strategy to conserve memory without sacrificing model accuracy. However, this technique suffers from severe redundant computation overhead, leading to a substantial increase in execution latency. A feasible solution to address this issue is mixed-precision quantization, but it faces the challenges of accuracy degradation and a time-consuming search time. In this paper, we propose QuantMCU, a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation. We first utilize value-driven patch classification (VDPC) to maintain the model accuracy. VDPC classifies patches into two classes based on whether they contain outlier values. For patches containing outlier values, we apply 8-bit quantization to the feature maps on the dataflow branches that follow. In addition, for patches without outlier values, we utilize value-driven quantization search (VDQS) on the feature maps of their following dataflow branches to reduce search time. Specifically, VDQS introduces a novel quantization search metric that takes into account both computation and accuracy, and it employs entropy as an accuracy representation to avoid additional training. VDQS also adopts an iterative approach to determine the bitwidth of each feature map to further accelerate the search process. Experimental results on real-world MCU devices show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy compared to the state-of-the-art patch-based inference methods.

Value-Driven Mixed-Precision Quantization for Patch-Based Inference on Microcontrollers

TL;DR

This work tackles the challenge of deploying neural networks on resource-constrained MCUs by addressing the redundant computation in patch-based inference. It introduces QuantMCU, which pairs value-driven patch classification (VDPC) with value-driven quantization search (VDQS) to reduce computation and memory without sacrificing accuracy; VDPC assigns patches to outlier or non-outlier classes, applying 8-bit quantization to outliers and their downstream branches, while VDQS uses an entropy-based, training-free metric to guide a lightweight per-feature-map bitwidth selection under memory constraints. The quantization score integrates computation and accuracy via , with and defined through BitOPs reductions and entropy changes, respectively. Experimental results on real MCU devices with ImageNet and Pascal VOC demonstrate BitOPs reduction and latency reduction on average over state-of-the-art patch-based methods, while maintaining comparable accuracy. The approach enables practical, low-latency, memory-efficient patch-based inference on MCUs, highlighting the value of combining patch-level value awareness with entropy-guided quantization.

Abstract

Deploying neural networks on microcontroller units (MCUs) presents substantial challenges due to their constrained computation and memory resources. Previous researches have explored patch-based inference as a strategy to conserve memory without sacrificing model accuracy. However, this technique suffers from severe redundant computation overhead, leading to a substantial increase in execution latency. A feasible solution to address this issue is mixed-precision quantization, but it faces the challenges of accuracy degradation and a time-consuming search time. In this paper, we propose QuantMCU, a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation. We first utilize value-driven patch classification (VDPC) to maintain the model accuracy. VDPC classifies patches into two classes based on whether they contain outlier values. For patches containing outlier values, we apply 8-bit quantization to the feature maps on the dataflow branches that follow. In addition, for patches without outlier values, we utilize value-driven quantization search (VDQS) on the feature maps of their following dataflow branches to reduce search time. Specifically, VDQS introduces a novel quantization search metric that takes into account both computation and accuracy, and it employs entropy as an accuracy representation to avoid additional training. VDQS also adopts an iterative approach to determine the bitwidth of each feature map to further accelerate the search process. Experimental results on real-world MCU devices show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy compared to the state-of-the-art patch-based inference methods.
Paper Structure (12 sections, 7 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a): A simple demonstration of patch-based inference process (only two patches are drawn). (b): A comparison experiment result of the inference latency of patch-based and layer-based inference.
  • Figure 2: (a): The distribution of the output activation value of the first layer in ResNet18. (b): The separation of outlier value and non-outlier value.
  • Figure 3: A demonstration of VDPC. Patch1 is classified as an outlier class patch since it contains an outlier value at its bottom right corner. Patch2 is classified as a non-outlier class patch since it does not contain any outlier value. For patch1 and dataflow branch1, we apply 8-bit quantization, while for patch2 and dataflow branch2, we apply mixed-precision quantization.
  • Figure 4: The accuracy comparison of QuantMCU with patch-based inference on different networks on two different datasets.
  • Figure 5: Top-1 and Top-5 accuracy of QuantMCU under different $\phi$ values on MobileNetV2 network on the ImageNet dataset.
  • ...and 1 more figures