Table of Contents
Fetching ...

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

Mihaela-Larisa Clement, Mónika Farsang, Felix Resch, Mihai-Teodor Stanusoiu, Radu Grosu

TL;DR

This work tackles the sim-to-real gap and sensor-noise challenges in autonomous navigation by combining RGB with depth (RGB-D) in lightweight recurrent controllers deployed on a small-scale roboracer. It compares early fusion, late fusion, and depth-aware deformable CNN approaches, using a 100k-frame open-loop dataset collected indoors with a RealSense RGB-D camera and a human expert driver. The key finding is that RGB-D using early fusion yields the most robust closed-loop performance, maintaining safe, obstacle-avoidant behavior under frame drops and noise, and even transferring to unseen intersections and dynamic obstacles. The study demonstrates practical, depth-enabled perception on resource-constrained hardware, offering a path toward robust, real-world autonomous navigation on low-cost platforms and informing design choices for multimodal perception in constrained settings.

Abstract

Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents' ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models were successfully deployed on real hardware and inherently avoided dynamic and static obstacles, under out-of-distribution conditions. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network's focus on the task.

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

TL;DR

This work tackles the sim-to-real gap and sensor-noise challenges in autonomous navigation by combining RGB with depth (RGB-D) in lightweight recurrent controllers deployed on a small-scale roboracer. It compares early fusion, late fusion, and depth-aware deformable CNN approaches, using a 100k-frame open-loop dataset collected indoors with a RealSense RGB-D camera and a human expert driver. The key finding is that RGB-D using early fusion yields the most robust closed-loop performance, maintaining safe, obstacle-avoidant behavior under frame drops and noise, and even transferring to unseen intersections and dynamic obstacles. The study demonstrates practical, depth-enabled perception on resource-constrained hardware, offering a path toward robust, real-world autonomous navigation on low-cost platforms and informing design choices for multimodal perception in constrained settings.

Abstract

Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents' ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models were successfully deployed on real hardware and inherently avoided dynamic and static obstacles, under out-of-distribution conditions. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network's focus on the task.

Paper Structure

This paper contains 22 sections, 14 figures, 1 table.

Figures (14)

  • Figure 1: High-level overview of the model architectures.
  • Figure 2: Tracks built for human expert data collection.
  • Figure 3: Driving stack overview for data collection step.
  • Figure 4: The sequence of layers in the convolutional heads is as follows. The Convolutional Feature Extractor is used with an RGB input of size 3×120×212. In the early fusion (EARLY) method, it processes an input of size 4×120×212. In the late fusion (LATE) method, it is used twice, treating RGB and depth as two separate input streams of sizes 3×120×212 and 1×120×212, respectively, and concatenating their features after the last layer. The Depth-Adapted Convolutional Feature Extractor has two versions: (1) the DCN, which includes a Convolutional Offset Extractor, and (2) the ZACN wu2020depthadaptedcnnrgbdcameras, which incorporates Geometric Offset Computation. In the hyperparameters section: F denotes the number of filters, K the kernel size, S the stride, and P the padding.
  • Figure 5: Extraction of a map used during closed-loop active testing. Problematic turns are marked with an 'X' to indicate crashes. Labels specify which model crashed at each point.
  • ...and 9 more figures