UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection
Haocheng Zhao, Runwei Guan, Taoyu Wu, Ka Lok Man, Limin Yu, Yutao Yue
TL;DR
UniBEVFusion tackles robustness and efficiency in radar-vision BEV-based 3D object detection by introducing the Radar Depth Lift-Splat-Shoot (RDL) module to inject radar-specific cues into depth prediction and the Unified Feature Fusion (UFF) framework to unify cross-modal features. The approach is evaluated on VoD and TJ4D, where it achieves state-of-the-art gains on TJ4D and strong performance in challenging driving-corridor regions, especially for occluded and short-range cases. A novel Failure Test (FT) demonstrates that UFF reduces reliance on vision and improves resilience to visual degradation, while RDL enhances depth accuracy. The work suggests practical significance for robust, cost-effective multi-modal sensing in autonomous driving, with potential for broader failure-mode testing and optimization.
Abstract
4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrate that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 1.44 in 3D and 1.72 in BEV object detection accuracy.
