RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network
Zhiwei Lin, Zhe Liu, Yongtao Wang, Le Zhang, Ce Zhu
TL;DR
The paper addresses robust radar-camera fusion for accurate 3D perception in autonomous driving. It introduces RCBEVDet, combining RadarBEVNet for radar BEV feature extraction with CAMF for deformable cross-attention-based fusion, and extends to RCBEVDet++ to support sparse fusion and query-based camera models across 3D detection, BEV segmentation, and 3D MOT. On nuScenes, RCBEVDet++ achieves state-of-the-art radar-camera multi-modal results, notably $72.73$ NDS and $67.34$ mAP for 3D object detection with ViT-L without test-time augmentation, along with strong BEV segmentation and MOT performance. The approach demonstrates robustness to sensor failure and broad generalization across backbones and detector architectures, highlighting its practical impact for cost-effective, reliable autonomous driving perception.
Abstract
Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.
