Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics

Marco Job; Thomas Stastny; Tim Kazik; Roland Siegwart; Michael Pantic

Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics

Marco Job, Thomas Stastny, Tim Kazik, Roland Siegwart, Michael Pantic

TL;DR

This work encodes measurements from a low-cost mmWave radar into the input space of a state-of-the-art monocular depth estimation model, and introduces a novel methodology for synthesizing rendered, realistic learning datasets based on photogrammetric data that simulate the radar sensor observations for training.

Abstract

Mobile robots require accurate and robust depth measurements to understand and interact with the environment. While existing sensing modalities address this problem to some extent, recent research on monocular depth estimation has leveraged the information richness, yet low cost and simplicity of monocular cameras. These works have shown significant generalization capabilities, mainly in automotive and indoor settings. However, robots often operate in environments with limited scale cues, self-similar appearances, and low texture. In this work, we encode measurements from a low-cost mmWave radar into the input space of a state-of-the-art monocular depth estimation model. Despite the radar's extreme point cloud sparsity, our method demonstrates generalization and robustness across industrial and outdoor experiments. Our approach reduces the absolute relative error of depth predictions by 9-64% across a range of unseen, real-world validation datasets. Importantly, we maintain consistency of all performance metrics across all experiments and scene depths where current vision-only approaches fail. We further address the present deficit of training data in mobile robotics environments by introducing a novel methodology for synthesizing rendered, realistic learning datasets based on photogrammetric data that simulate the radar sensor observations for training. Our code, datasets, and pre-trained networks are made available at https://github.com/ethz-asl/radarmeetsvision.

Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics

TL;DR

Abstract

Paper Structure (10 sections, 3 equations, 6 figures, 3 tables)

This paper contains 10 sections, 3 equations, 6 figures, 3 tables.

Introduction
Related Work
Method
Architecture
Training Datasets
Validation Datasets
Radar Image Projection
Experimental Design
Results
Conclusion

Figures (6)

Figure 1: Top row: 3D rendering of the Rhône glacier in Switzerland, one of the validation testing sites. Middle row: RGB input image into the network, combined with the sparse radar observation in red (left) and metric depth prediction of our approach (right). Bottom row: Absolute relative error compared to LiDAR ground truth of our approach (left) and the work of depthanythingv2.
Figure 2: Absolute Relative Error in the works depthanythingv2bhat2023zoedepthzeroshottransfercombiningdepthanythingli2024radarcamdepthradarcamerafusiondepth divided into the categories 'Automotive', 'Indoor' and 'Outdoor'. 'Automotive' clearly outperforms the other categories, especially the 'Outdoor' category.
Figure 3: Overview of the inference and training architecture of our approach. We extend the input space of the architecture to $640\times 480 \times 4$; the additional channel encodes the sparse radar depth (SD). We extend the network that creates the embeddings for the vision transformer to support the additional channel. The output head also extends to an additional channel, and we obtain the metric depth prediction described in \ref{['eq:depth_weight']}. All green components are trained at a high learning rate, whereas blue components are only fine-tuned.
Figure 4: An overview of the four generated training datasets: From top to bottom row, we show samples of the road area, the Rhône glacier, the rural farming area, and the mountainous area.
Figure 5: Handheld sensor rig, using a TI mmWave AWR1843AOPEVM radar, a FLIR FFY-U3-16S2C-S using a 3.6 mm lense. Red arrows represent the x-axis, green arrows represent the y-axis, and blue arrows represent the z-axis. The data is recorded and processed by a Nvidia Jetson Xavier NX.
...and 1 more figures

Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics

TL;DR

Abstract

Radar Meets Vision: Robustifying Monocular Metric Depth Prediction for Mobile Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (6)