Table of Contents
Fetching ...

Fusing Multi-sensor Input with State Information on TinyML Brains for Autonomous Nano-drones

Luca Crupi, Elia Cereda, Daniele Palossi

TL;DR

This work tackles sense-and-act limitations of ultra-low-power TinyML nano-drones by embedding drone state information into a lightweight CNN for allocentric human pose estimation using multi-sensor input (grayscale $160\times96$ images and an $8\times8$ depth map) to predict $x$ and $y$. It extends a SoA baseline CNN with four state-aware fusion schemes (input, mid, late direct, and late with MLP) that incorporate attitude angles ($\varphi$, $\theta$) represented as either 2D state maps or a 2-element vector, and trains entirely in simulation with domain randomization before evaluating on a real-world $\sim$3.5k sample set. The ablation study shows consistent $R^2$ gains when using state information, with the best late-fusion direct configuration delivering up to $+0.10$ on $x$ and $+0.01$ on $y$, with negligible MAC and memory overhead (around 0.11% and minimal changes). Overall, the paper demonstrates a practical path to enhancing allocentric perception on TinyML platforms, enabling more capable autonomous nano-drones without significant energy or compute penalties, validated across diverse physical and simulated scenarios.

Abstract

Autonomous nano-drones (~10 cm in diameter), thanks to their ultra-low power TinyML-based brains, are capable of coping with real-world environments. However, due to their simplified sensors and compute units, they are still far from the sense-and-act capabilities shown in their bigger counterparts. This system paper presents a novel deep learning-based pipeline that fuses multi-sensorial input (i.e., low-resolution images and 8x8 depth map) with the robot's state information to tackle a human pose estimation task. Thanks to our design, the proposed system -- trained in simulation and tested on a real-world dataset -- improves a state-unaware State-of-the-Art baseline by increasing the R^2 regression metric up to 0.10 on the distance's prediction.

Fusing Multi-sensor Input with State Information on TinyML Brains for Autonomous Nano-drones

TL;DR

This work tackles sense-and-act limitations of ultra-low-power TinyML nano-drones by embedding drone state information into a lightweight CNN for allocentric human pose estimation using multi-sensor input (grayscale images and an depth map) to predict and . It extends a SoA baseline CNN with four state-aware fusion schemes (input, mid, late direct, and late with MLP) that incorporate attitude angles (, ) represented as either 2D state maps or a 2-element vector, and trains entirely in simulation with domain randomization before evaluating on a real-world 3.5k sample set. The ablation study shows consistent gains when using state information, with the best late-fusion direct configuration delivering up to on and on , with negligible MAC and memory overhead (around 0.11% and minimal changes). Overall, the paper demonstrates a practical path to enhancing allocentric perception on TinyML platforms, enabling more capable autonomous nano-drones without significant energy or compute penalties, validated across diverse physical and simulated scenarios.

Abstract

Autonomous nano-drones (~10 cm in diameter), thanks to their ultra-low power TinyML-based brains, are capable of coping with real-world environments. However, due to their simplified sensors and compute units, they are still far from the sense-and-act capabilities shown in their bigger counterparts. This system paper presents a novel deep learning-based pipeline that fuses multi-sensorial input (i.e., low-resolution images and 8x8 depth map) with the robot's state information to tackle a human pose estimation task. Thanks to our design, the proposed system -- trained in simulation and tested on a real-world dataset -- improves a state-unaware State-of-the-Art baseline by increasing the R^2 regression metric up to 0.10 on the distance's prediction.
Paper Structure (4 sections, 2 figures, 1 table)

This paper contains 4 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: A) Our CNN inputs. B) CNN architecture exploration, based on the SoA vision+depth backbone from crupi2023sim. Our proposed state-aware models either with 1) input fusion, 2) mid fusion, 3) late fusion (direct), and 4) late fusion with MLP.
  • Figure 2: $R^2$ comparison of our fusion techniques (i.e., input, mid, direct, and MLP) with and without dropout and with different states (i.e., pitch, roll, and pitch+roll) vs. the state-unaware SoA baseline crupi2023sim. Each marker is the average of 5 different training.