Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

Elia Cereda; Stefano Bonato; Mirko Nava; Alessandro Giusti; Daniele Palossi

Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

Elia Cereda, Stefano Bonato, Mirko Nava, Alessandro Giusti, Daniele Palossi

TL;DR

This work tackles non-egocentric vision-based 3D pose estimation in robotics by introducing vision-state fusion, i.e., fusing the robot's onboard state as an auxiliary input to CNNs. It analyzes four fusion approaches, identifies an MLP-based state-feature branch as the best trade-off, and applies the method to three diverse use cases (A2O, D2D, D2H) with both simulated and real-field deployments. Results show substantial improvements in regression accuracy, including up to a +$R^2$ increase of 0.51 and an average $MAE$ reduction of 24% in field tests, validating the approach across hardware-constrained platforms. The findings demonstrate that incorporating robot state enhances interpretation of visual data for external targets, enabling more reliable and robust autonomous perception in real-world robotics scenarios.

Abstract

Vision-based deep learning perception fulfills a paramount role in robotics, facilitating solutions to many challenging scenarios, such as acrobatic maneuvers of autonomous unmanned aerial vehicles (UAVs) and robot-assisted high-precision surgery. Control-oriented end-to-end perception approaches, which directly output control variables for the robot, commonly take advantage of the robot's state estimation as an auxiliary input. When intermediate outputs are estimated and fed to a lower-level controller, i.e. mediated approaches, the robot's state is commonly used as an input only for egocentric tasks, which estimate physical properties of the robot itself. In this work, we propose to apply a similar approach for the first time -- to the best of our knowledge -- to non-egocentric mediated tasks, where the estimated outputs refer to an external subject. We prove how our general methodology improves the regression performance of deep convolutional neural networks (CNNs) on a broad class of non-egocentric 3D pose estimation problems, with minimal computational cost. By analyzing three highly-different use cases, spanning from grasping with a robotic arm to following a human subject with a pocket-sized UAV, our results consistently improve the R\textsuperscript{2} regression metric, up to +0.51, compared to their stateless baselines. Finally, we validate the in-field performance of a closed-loop autonomous cm-scale UAV on the human pose estimation task. Our results show a significant reduction, i.e., 24\% on average, on the mean absolute error of our stateful CNN, compared to a State-of-the-Art stateless counterpart.

Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

TL;DR

increase of 0.51 and an average

reduction of 24% in field tests, validating the approach across hardware-constrained platforms. The findings demonstrate that incorporating robot state enhances interpretation of visual data for external targets, enabling more reliable and robust autonomous perception in real-world robotics scenarios.

Abstract

Paper Structure (17 sections, 2 equations, 10 figures, 1 table)

This paper contains 17 sections, 2 equations, 10 figures, 1 table.

INTRODUCTION
RELATED WORK
USE CASES, MODELS, AND DEPLOYMENT
Robot arm-to-object: A2O
Drone-to-Drone: D2D
Drone-To-Human: D2H
Vision-state fusion techniques
Proposed CNN architectures
In-field deployment: D2H
Training and hyper-parameters
EXPERIMENTAL RESULTS
Regression performance: A2O
Regression performance: D2D
Regression performance: D2H
In-field experimental results: D2H
...and 2 more sections

Figures (10)

Figure 1: Robotics system architecture with proposed auxiliary state input to a non-egocentric perception CNN.
Figure 2: Our 3D pose estimation use cases: A) arm-to-object (simulation), B) drone-to-drone (on-board view), and C) drone-to-human (in-field test).
Figure 3: Reference frames in the D2H use case.
Figure 4: Individual photometric data augmentations (top). Ten images produced by the full augmentation pipeline (bottom).
Figure 5: Proposed stateful CNN architectures extended with a multi-layer perceptron branch (MLP). A) A2O use case: MobileNetV2-based CNN, with details of the repeated bottleneck residual blocks and bottleneck blocks. B) D2D and D2H use cases: PULP-Frontnet-based CNN.
...and 5 more figures

Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

TL;DR

Abstract

Vision-State Fusion: Improving Deep Neural Networks for Autonomous Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (10)