Table of Contents
Fetching ...

DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units

Liam Boyle, Julian Moosmann, Nicolas Baumann, Seonyeong Heo, Michele Magno

TL;DR

An adaptive tiling method is proposed for lightweight and energy-efficient object detection networks, including You Only Look Once (YOLO)-based models and the popular Faster Objects More Objects (FOMO) network, which enables object detection on low-power microcontroller units (MCUs) with no compromise on accuracy compared to large-scale detection models.

Abstract

Advances in lightweight neural networks have revolutionized computer vision in a broad range of IoT applications, encompassing remote monitoring and process automation. However, the detection of small objects, which is crucial for many of these applications, remains an underexplored area in current computer vision research, particularly for low-power embedded devices that host resource-constrained processors. To address said gap, this paper proposes an adaptive tiling method for lightweight and energy-efficient object detection networks, including YOLO-based models and the popular FOMO network. The proposed tiling enables object detection on low-power MCUs with no compromise on accuracy compared to large-scale detection models. The benefit of the proposed method is demonstrated by applying it to FOMO and TinyissimoYOLO networks on a novel RISC-V-based MCU with built-in ML accelerators. Extensive experimental results show that the proposed tiling method boosts the F1-score by up to 225% for both FOMO and TinyissimoYOLO networks while reducing the average object count error by up to 76% with FOMO and up to 89% for TinyissimoYOLO. Furthermore, the findings of this work indicate that using a soft F1 loss over the popular binary cross-entropy loss can serve as an implicit non-maximum suppression for the FOMO network. To evaluate the real-world performance, the networks are deployed on the RISC-V based GAP9 microcontroller from GreenWaves Technologies, showcasing the proposed method's ability to strike a balance between detection performance ($58% - 95%$ F1 score), low latency (0.6 ms/Inference - 16.2 ms/Inference}), and energy efficiency (31 uJ/Inference} - 1.27 mJ/Inference) while performing multiple predictions using high-resolution images on a MCU.

DSORT-MCU: Detecting Small Objects in Real-Time on Microcontroller Units

TL;DR

An adaptive tiling method is proposed for lightweight and energy-efficient object detection networks, including You Only Look Once (YOLO)-based models and the popular Faster Objects More Objects (FOMO) network, which enables object detection on low-power microcontroller units (MCUs) with no compromise on accuracy compared to large-scale detection models.

Abstract

Advances in lightweight neural networks have revolutionized computer vision in a broad range of IoT applications, encompassing remote monitoring and process automation. However, the detection of small objects, which is crucial for many of these applications, remains an underexplored area in current computer vision research, particularly for low-power embedded devices that host resource-constrained processors. To address said gap, this paper proposes an adaptive tiling method for lightweight and energy-efficient object detection networks, including YOLO-based models and the popular FOMO network. The proposed tiling enables object detection on low-power MCUs with no compromise on accuracy compared to large-scale detection models. The benefit of the proposed method is demonstrated by applying it to FOMO and TinyissimoYOLO networks on a novel RISC-V-based MCU with built-in ML accelerators. Extensive experimental results show that the proposed tiling method boosts the F1-score by up to 225% for both FOMO and TinyissimoYOLO networks while reducing the average object count error by up to 76% with FOMO and up to 89% for TinyissimoYOLO. Furthermore, the findings of this work indicate that using a soft F1 loss over the popular binary cross-entropy loss can serve as an implicit non-maximum suppression for the FOMO network. To evaluate the real-world performance, the networks are deployed on the RISC-V based GAP9 microcontroller from GreenWaves Technologies, showcasing the proposed method's ability to strike a balance between detection performance ( F1 score), low latency (0.6 ms/Inference - 16.2 ms/Inference}), and energy efficiency (31 uJ/Inference} - 1.27 mJ/Inference) while performing multiple predictions using high-resolution images on a MCU.

Paper Structure

This paper contains 15 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the proposed adaptive tiling method for small object detection.
  • Figure 2: Example image split into 12 overlapping tiles. Tiles are extracted from the image and downsampled to the network input resolution. This example demonstrates how objects, that are cut off by the tile border (see yellow tile) are fully in view in the neighboring tile thanks to the tiles being overlapping.
  • Figure 3: The figure illustrates the matching criterion for predictions made by TinyissimoYOLO. The blue car is only partially visible inside the yellow tile thus, the network can only predict a bounding box that only covers part of the car. Since neighboring tiles are overlapping, the same car is fully visible in the purple tile resulting in a bounding box prediction that covers the whole car. The right side shows the calculation of and intersection ratios for both tiles.
  • Figure 4: The green rectangles indicate true positive predictions, the blue rectangles are ground truth object bounding boxes that were correctly predicted, the red rectangles show false positive predictions and lastly the purple rectangles are ground truth bounding boxes for cars that were not identified. In subfigure a) and b) we show the prediction output of the baseline model which is the default network with and without the fusion method, respectively. Subfigures c) and d) depict the output of using our tiling method. In d) one may observe that training with a soft F1 loss significantly reduces the number of false positives. This graphic is reused from Boyle et al. liam_tiling.
  • Figure 5: This figure shows the resulting F1 metrics as well as the latency for different configurations of our implementation of .
  • ...and 2 more figures