Table of Contents
Fetching ...

UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

Haocheng Zhao, Runwei Guan, Taoyu Wu, Ka Lok Man, Limin Yu, Yutao Yue

TL;DR

UniBEVFusion tackles robustness and efficiency in radar-vision BEV-based 3D object detection by introducing the Radar Depth Lift-Splat-Shoot (RDL) module to inject radar-specific cues into depth prediction and the Unified Feature Fusion (UFF) framework to unify cross-modal features. The approach is evaluated on VoD and TJ4D, where it achieves state-of-the-art gains on TJ4D and strong performance in challenging driving-corridor regions, especially for occluded and short-range cases. A novel Failure Test (FT) demonstrates that UFF reduces reliance on vision and improves resilience to visual degradation, while RDL enhances depth accuracy. The work suggests practical significance for robust, cost-effective multi-modal sensing in autonomous driving, with potential for broader failure-mode testing and optimization.

Abstract

4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.

UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection

TL;DR

UniBEVFusion tackles robustness and efficiency in radar-vision BEV-based 3D object detection by introducing the Radar Depth Lift-Splat-Shoot (RDL) module to inject radar-specific cues into depth prediction and the Unified Feature Fusion (UFF) framework to unify cross-modal features. The approach is evaluated on VoD and TJ4D, where it achieves state-of-the-art gains on TJ4D and strong performance in challenging driving-corridor regions, especially for occluded and short-range cases. A novel Failure Test (FT) demonstrates that UFF reduces reliance on vision and improves resilience to visual degradation, while RDL enhances depth accuracy. The work suggests practical significance for robust, cost-effective multi-modal sensing in autonomous driving, with potential for broader failure-mode testing and optimization.

Abstract

4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.
Paper Structure (16 sections, 1 equation, 4 figures, 5 tables)

This paper contains 16 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed UniBEVFusion network. The network consists of four main stages: Image, Radar, Fusion, and BEV. The Image and Radar stages are responsible for extracting BEV features from the image and radar, respectively. The Fusion stage is responsible for the fusion of the BEV features from the Image and Radar stages. The BEV stage is responsible for the final BEV feature extraction and 3D object detection head.
  • Figure 2: Radar Depth Lift-Splat-Shoot (RDL) module.
  • Figure 3: Unified Feature Fusion (UFF).
  • Figure 4: Comparison of detection results between UniBEVFusion and BEVFusion liu2023bevfusion. 2D GT and 3D GT are the ground truth of 2D and 3D detection, respectively. The BEV and BEV$_{\text{Feat}}$ are the detection results and fused BEV feature of BEVFusion, respectively. The UniBEV and UniBEV$_{\text{Feat}}$ are the detection results and fused BEV feature of UniBEVFusion, respectively. Red, green, and blue boxes represent cars, pedestrians, and cyclists, respectively.