Table of Contents
Fetching ...

TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

Yiran Wang, Jiaqi Li, Chaoyi Hong, Ruibo Li, Liusheng Sun, Xiao Song, Zhe Wang, Zhiguo Cao, Guosheng Lin

TL;DR

TacoDepth tackles dense, metric Radar-Camera depth estimation under Radar sparsity and noise by introducing a one-stage fusion framework. It combines a graph-based Radar structure extractor with a pyramid-based Radar fusion module, augmented by Radar-centered flash attention to tightly couple Radar and image features without intermediate depth predictions. The method supports both independent and plug-in inference, achieving real-time performance (over 37 fps in independent mode) and substantial accuracy gains over prior multi-stage approaches and RadarCam-Depth with different predictors. Across nuScenes and ZJU-4DRadarCam, TacoDepth delivers consistent improvements in MAE, RMSE, and efficiency, and ablations confirm the effectiveness of the graph extraction and pyramid fusion design for robust Radar-Camera depth estimation in diverse conditions.

Abstract

Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.

TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

TL;DR

TacoDepth tackles dense, metric Radar-Camera depth estimation under Radar sparsity and noise by introducing a one-stage fusion framework. It combines a graph-based Radar structure extractor with a pyramid-based Radar fusion module, augmented by Radar-centered flash attention to tightly couple Radar and image features without intermediate depth predictions. The method supports both independent and plug-in inference, achieving real-time performance (over 37 fps in independent mode) and substantial accuracy gains over prior multi-stage approaches and RadarCam-Depth with different predictors. Across nuScenes and ZJU-4DRadarCam, TacoDepth delivers consistent improvements in MAE, RMSE, and efficiency, and ablations confirm the effectiveness of the graph extraction and pyramid fusion design for robust Radar-Camera depth estimation in diverse conditions.

Abstract

Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming and not robust. To address these challenges, we propose TacoDepth, an efficient and accurate Radar-Camera depth estimation model with one-stage fusion. Specifically, the graph-based Radar structure extractor and the pyramid-based Radar fusion module are designed to capture and integrate the graph structures of Radar point clouds, delivering superior model efficiency and robustness without relying on the intermediate depth results. Moreover, TacoDepth can be flexible for different inference modes, providing a better balance of speed and accuracy. Extensive experiments are conducted to demonstrate the efficacy of our method. Compared with the previous state-of-the-art approach, TacoDepth improves depth accuracy and processing speed by 12.8% and 91.8%. Our work provides a new perspective on efficient Radar-Camera depth estimation.

Paper Structure

This paper contains 28 sections, 5 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Performance and efficiency. Circle area indicates inference time (ms). Smaller circles showcase faster speed. The X-axis and Y-axis represent MAE and RMSE metrics for depth errors on the nuScenes nus dataset. Lower MAE and RMSE mean higher accuracy. Our TacoDepth outperforms prior arts by large margins.
  • Figure 2: Intermediate quasi-dense depth maps Singh_2023_CVPRicra24rcpdairos20ethdornradarlo2023rcdptiros24 remain sparse and noisy. Pixels with valid depth values are visualized by the gray areas in the red rectangular boxes.
  • Figure 3: Overview of the TacoDepth. The graph-based Radar structure extractor captures graph structures of Radar point clouds through the node feature $N_l$ and edge feature $E_l$ in a certain layer $l$. The pyramid-based Radar fusion module integrates image and Radar features in a pyramidal hierarchical manner. In each layer, to efficiently build cross-modal correspondences, the Radar-centered flash attention is calculated within Radar-centered areas based on horizontal coordinates, e.g., pixels in the orange area as queries, while Radar points in the red area as keys and values. Our TacoDepth achieves efficient and accurate Radar-Camera depth estimation in one stage. During inference, the model is flexible and supports both independent and plug-in processing, facilitating a better balance of model efficiency and accuracy.
  • Figure 4: Visual results of independent models on nuScenes nus. Both daytime and nighttime samples are presented. Prior arts rcpdaSingh_2023_CVPR exhibit disrupted structures and noticeable artifacts. TacoDepth produces accurate depth with finer details and more complete structures.
  • Figure 5: Visual results of plug-in models icra24 on the ZJU-4DRadarCam icra24. The same depth predictor DPT-Hybrid dpt is adopted for fair comparisons. Regions with obvious differences are highlighted in the rectangular boxes. Best view zoomed in on-screen for details.
  • ...and 8 more figures