Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini; Manuele Rusci; Elia Cereda; Luca Benini; Francesco Conti; Daniele Palossi

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti, Daniele Palossi

TL;DR

The paper tackles domain shift in monocular depth estimation on ultra-low-power MCUs by introducing a multi-modal On-Device Learning approach. It combines on-device data collection with pseudo-labels from a small 8x8 depth sensor and a memory-driven sparse update to fine-tune a tiny μPyD-Net model directly on the GAP9 MCU. The method achieves substantial in-field adaptation with limited memory, completing training in about 17.8 minutes using roughly 3k samples and reducing RMSE from 4.9 m to around 0.6 m on real-world IDSIA data, while maintaining competitive depth accuracy on public benchmarks. This work demonstrates the feasibility of fully on-device MDE adaptation for IoT deployments, enabling autonomous, energy-efficient depth sensing without bulky or power-hungry peripherals and releasing datasets and code to spur further research.

Abstract

Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

TL;DR

Abstract

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)