Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

Mihaela-Larisa Clement; Mónika Farsang; Felix Resch; Mihai-Teodor Stanusoiu; Radu Grosu

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

Mihaela-Larisa Clement, Mónika Farsang, Felix Resch, Mihai-Teodor Stanusoiu, Radu Grosu

TL;DR

This work tackles the sim-to-real gap and sensor-noise challenges in autonomous navigation by combining RGB with depth (RGB-D) in lightweight recurrent controllers deployed on a small-scale roboracer. It compares early fusion, late fusion, and depth-aware deformable CNN approaches, using a 100k-frame open-loop dataset collected indoors with a RealSense RGB-D camera and a human expert driver. The key finding is that RGB-D using early fusion yields the most robust closed-loop performance, maintaining safe, obstacle-avoidant behavior under frame drops and noise, and even transferring to unseen intersections and dynamic obstacles. The study demonstrates practical, depth-enabled perception on resource-constrained hardware, offering a path toward robust, real-world autonomous navigation on low-cost platforms and informing design choices for multimodal perception in constrained settings.

Abstract

Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents' ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models were successfully deployed on real hardware and inherently avoided dynamic and static obstacles, under out-of-distribution conditions. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network's focus on the task.

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

TL;DR

Abstract

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)