Table of Contents
Fetching ...

RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale

Han Li, Yukai Ma, Yaqing Gu, Kewei Hu, Yong Liu, Xingxing Zuo

TL;DR

This work proposes a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner.

Abstract

We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at \url{https://github.com/MMOCKING/RadarCam-Depth}.

RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale

TL;DR

This work proposes a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner.

Abstract

We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively. Our code and dataset will be released at \url{https://github.com/MMOCKING/RadarCam-Depth}.
Paper Structure (25 sections, 4 equations, 7 figures, 5 tables)

This paper contains 25 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Top: 3D visualization of the metric depth estimation from our proposed RadarCam-Depth; Middle: Our metric depth estimation overlaid on corresponding error map; Bottom: Error map of Mono-depth after scale-aligned to Radar points. Our depth estimation exhibits exceptional metric accuracy and fine details.
  • Figure 2: The overall framework of our proposed RadarCam-Depth, comprised with four stages: monocular depth prediction, global alignment of mono-depth with sparse Radar depth, learned quasi-dense scale estimation, and scale map learner for refining local scale. $\mathbf{d}$ and $\mathbf{s}$ denotes the depth and scale, while $\mathbf{z}=1/\mathbf{d}$ is the inverse depth.
  • Figure 3: Left: the input image. Middle: Mono-Pred of MiDaS v3.1 birkl2023midas. Right: Mono-Pred of DPT-Hybrid ranftl2021vision. Notably, MiDaS exhibits the ability to differentiate the sky.
  • Figure 4: (a) Top: nuScenes dataset caesar2020nuscenes with LiDAR depth $\mathbf{d}_{gt}$ and accumulated LiDAR depth $\mathbf{d}_{acc}$, depth from 3D Radar point cloud $\mathbf{P}$, interpolated LiDAR $\mathbf{d}_{int}$ shown clockwise. The misalignment between LiDAR points and image pixels on this dataset is highlighted with red boxes. Depth from the 3D Radar point cloud is very sparse and non-uniformly distributed. (b) Our ZJU-4DRadarCam dataset with LiDAR depth $\mathbf{d}_{gt}$, interpolated LiDAR depth $\mathbf{d}_{int}$, and depth from 4D Radar point cloud $\mathbf{P}$ shown from top to bottom. Compared to nuScenes, the ZJU-4DRadarCam dataset offers more accurate and denser LiDAR depth and denser 4D Radar depth.
  • Figure 5: (a) Our metric depth estimation over the input image in a large-scale scenario. (b) Top row shows the ground truth depth $\mathbf{d}_{int}$ and Radar points $\mathbf{P}$ projected into image $\mathbf{I}$. The rest rows from top to bottom depict the depth estimations of lo2021depth, singh2023depth, and our RadarCam-Depth and the corresponding error maps. Our method demonstrates much higher accuracy and fine details.
  • ...and 2 more figures