Table of Contents
Fetching ...

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, Song Han

TL;DR

MCUNetV2 tackles the memory bottleneck of tiny deep learning on microcontrollers by introducing patch-based inference that processes small spatial patches to drastically reduce peak memory. It couples this with receptive-field redistribution and a neural-architecture-search–driven co-design to minimize computation overhead while preserving accuracy. The approach delivers up to eightfold memory reduction and achieves a record ImageNet accuracy of 71.8% on MCU hardware, with Visual Wake Words exceeding 90% accuracy under 32kB SRAM and a 16.9% mAP improvement for Pascal VOC detection. This work broadens the practical capabilities of tinyML vision, enabling high-resolution classification and dense prediction tasks on severely resource-constrained devices.

Abstract

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

TL;DR

MCUNetV2 tackles the memory bottleneck of tiny deep learning on microcontrollers by introducing patch-based inference that processes small spatial patches to drastically reduce peak memory. It couples this with receptive-field redistribution and a neural-architecture-search–driven co-design to minimize computation overhead while preserving accuracy. The approach delivers up to eightfold memory reduction and achieves a record ImageNet accuracy of 71.8% on MCU hardware, with Visual Wake Words exceeding 90% accuracy under 32kB SRAM and a 16.9% mAP improvement for Pascal VOC detection. This work broadens the practical capabilities of tinyML vision, enabling high-resolution classification and dense prediction tasks on severely resource-constrained devices.

Abstract

Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

Paper Structure

This paper contains 47 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: MobileNetV2 sandler2018mobilenetv2 has a very imbalanced memory usage distribution. The peak memory is determined by the first 5 blocks with high peak memory, while the later blocks all share a small memory usage. By using per-patch inference ($4\times4$ patches), we are able to significantly reduce the memory usage of the first 5 blocks, and reduce the overall peak memory by 8$\times$, fitting MCUs with a 256kB memory budget. Notice that the model architecture and accuracy are not changed for the two settings. The memory usage is measured in int8.
  • Figure 2: Detection is more sensitive to smaller resolutions.
  • Figure 3: Per-patch inference can significantly reduce the peak memory required to execute a sequence of convolutional layers. We study two convolutional layers (stride 1 and 2). Under per-layer computation (a), the first convolution has a large input/output activation size, dominating the peak memory requirement. With per-patch computation (b), we allocate the buffer to host the final output activation, and compute the results patch-by-patch. We only need to store the activation from one patch but not the entire feature map, reducing the peak memory (the first input is the image, which can be partially decoded from a compressed format like JPEG).
  • Figure 4: The redistributed MobileNetV2 (MbV2-RD) has reduced receptive field for the per-patch inference stage and increased receptive field for the per-layer stage. The two networks have the same level of performance, but MbV2-RD has a smaller overhead under patch-based inference. The mobile inverted block is denoted as MB{expansion ratio} {kernel size}. The dashed border means stride=2.
  • Figure 5: Analytical profiling: patch-based inference significantly reduces the inference peak memory by 3.7-8.0$\times$ at a small computation overhead of 8-17%. The memory reduction and computation overhead are related to the network design. For MobileNetV2, we can reduce the computation overhead from 10% to 3% by redistributing the receptive field. All networks take an input resolution of $224^2$ and $4\times4$ patches.
  • ...and 7 more figures