Table of Contents
Fetching ...

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti, Daniele Palossi

TL;DR

The paper tackles domain shift in monocular depth estimation on ultra-low-power MCUs by introducing a multi-modal On-Device Learning approach. It combines on-device data collection with pseudo-labels from a small 8x8 depth sensor and a memory-driven sparse update to fine-tune a tiny μPyD-Net model directly on the GAP9 MCU. The method achieves substantial in-field adaptation with limited memory, completing training in about 17.8 minutes using roughly 3k samples and reducing RMSE from 4.9 m to around 0.6 m on real-world IDSIA data, while maintaining competitive depth accuracy on public benchmarks. This work demonstrates the feasibility of fully on-device MDE adaptation for IoT deployments, enabling autonomous, energy-efficient depth sensing without bulky or power-hungry peripherals and releasing datasets and code to spur further research.

Abstract

Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

TL;DR

The paper tackles domain shift in monocular depth estimation on ultra-low-power MCUs by introducing a multi-modal On-Device Learning approach. It combines on-device data collection with pseudo-labels from a small 8x8 depth sensor and a memory-driven sparse update to fine-tune a tiny μPyD-Net model directly on the GAP9 MCU. The method achieves substantial in-field adaptation with limited memory, completing training in about 17.8 minutes using roughly 3k samples and reducing RMSE from 4.9 m to around 0.6 m on real-world IDSIA data, while maintaining competitive depth accuracy on public benchmarks. This work demonstrates the feasibility of fully on-device MDE adaptation for IoT deployments, enabling autonomous, energy-efficient depth sensing without bulky or power-hungry peripherals and releasing datasets and code to spur further research.

Abstract

Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming 300mW. In its normal operation, this setup feeds a tiny 107 k-parameter PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

Paper Structure

This paper contains 24 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: GAP9Shield IoT node muller2024gap9shield mounted on a Crazyflie 2.1 (left) and the characteristics of the main components (table in the right).
  • Figure 2: $\mu$PyD-Net peluso2021monocular DNN architecture for monocular depth estimation.
  • Figure 3: Diagram of the main phases of our ODL method for MDE. (a) The IoT node continuously runs the inference-only task onboard and predicts the depth maps from single-camera images. In ODL mode, the system collects new data from the camera and the depth sensors (b) and then performs training on-device (c). In every phase, system components colored in white are not active.
  • Figure 4: Layer-wise breakdown of $\mu$PyD-Net memory cost, contributing to the total fine-tuning memory.
  • Figure 5: Images, ground truths and $\mu$PyD-Net predictions.
  • ...and 4 more figures