Table of Contents
Fetching ...

microYOLO: Towards Single-Shot Object Detection on Microcontrollers

Mark Deutel, Christopher Mutschler, Jürgen Teich

TL;DR

This work investigates the feasibility of single-shot object detection on microcontroller hardware using a compact YOLO-based model called microYOLO. By downsampling input to 128x128, employing depthwise separable convolutions, and using a small SxS grid with a limited number of bounding boxes, the approach enables deployment on Cortex-M7 devices with memory under 800 KB Flash and 350 KB RAM, achieving around 3.5 FPS on the OpenMV H7 R2. The model is trained with pruning and post-training 8-bit quantization and evaluated on three tasks (fridge groceries, humans, and vehicles), revealing varying mAP performance (highest on the fridge task) and providing detailed error analysis. Deployment results and a dedicated C-code pipeline demonstrate practical edge-AI viability on microcontrollers, while highlighting trade-offs between FPS, memory, and detection accuracy for future improvements.

Abstract

This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.

microYOLO: Towards Single-Shot Object Detection on Microcontrollers

TL;DR

This work investigates the feasibility of single-shot object detection on microcontroller hardware using a compact YOLO-based model called microYOLO. By downsampling input to 128x128, employing depthwise separable convolutions, and using a small SxS grid with a limited number of bounding boxes, the approach enables deployment on Cortex-M7 devices with memory under 800 KB Flash and 350 KB RAM, achieving around 3.5 FPS on the OpenMV H7 R2. The model is trained with pruning and post-training 8-bit quantization and evaluated on three tasks (fridge groceries, humans, and vehicles), revealing varying mAP performance (highest on the fridge task) and providing detailed error analysis. Deployment results and a dedicated C-code pipeline demonstrate practical edge-AI viability on microcontrollers, while highlighting trade-offs between FPS, memory, and detection accuracy for future improvements.

Abstract

This work-in-progress paper presents results on the feasibility of single-shot object detection on microcontrollers using YOLO. Single-shot object detectors like YOLO are widely used, however due to their complexity mainly on larger GPU-based platforms. We present microYOLO, which can be used on Cortex-M based microcontrollers, such as the OpenMV H7 R2, achieving about 3.5 FPS when classifying 128x128 RGB images while using less than 800 KB Flash and less than 350 KB RAM. Furthermore, we share experimental results for three different object detection tasks, analyzing the accuracy of microYOLO on them.
Paper Structure (7 sections, 5 figures, 3 tables)

This paper contains 7 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Architecture of $\mu$YOLO. Each row describes a layer, we omit activation functions (ReLU) and batch normalization layers for brevity. The tuples describe the convolutional layer ("C") and the depthwise seperable convolutional layers ("D") in the form of [channels, filters, kernel size, stride, padding] while they describe linear layers ("L") in the form of [num. input, num. output].
  • Figure 1: Average $\mathrm{mAP}^{0.5}$ achieved on the validation data of the considered object detection tasks over the course of 400 training epochs. 3 seeds for each task.
  • Figure 2: Average $\mathrm{mAP}^{0.5}$ achieved on the validation data of the vehicles task given different input image sizes and assuming a maximum of 3 bounding boxes per image. 3 seeds for each task.
  • Figure 3: Samples from the validation sets of the three considered detection tasks, adding to them all bounding boxes that achieved a confidence greater than 50%.
  • Figure 4: Normalized confusion matrices of predicted bounding boxes versus ground truth. The diagonals are correct predictions, while the upper and lower triangular matrices are errors. A bounding box is correct if its class prediction is correct and both its confidence and IoU with the ground truth are greater than 50%.