Table of Contents
Fetching ...

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

Zhiwei Lin, Zhe Liu, Yongtao Wang, Le Zhang, Ce Zhu

TL;DR

The paper addresses robust radar-camera fusion for accurate 3D perception in autonomous driving. It introduces RCBEVDet, combining RadarBEVNet for radar BEV feature extraction with CAMF for deformable cross-attention-based fusion, and extends to RCBEVDet++ to support sparse fusion and query-based camera models across 3D detection, BEV segmentation, and 3D MOT. On nuScenes, RCBEVDet++ achieves state-of-the-art radar-camera multi-modal results, notably $72.73$ NDS and $67.34$ mAP for 3D object detection with ViT-L without test-time augmentation, along with strong BEV segmentation and MOT performance. The approach demonstrates robustness to sensor failure and broad generalization across backbones and detector architectures, highlighting its practical impact for cost-effective, reliable autonomous driving perception.

Abstract

Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

TL;DR

The paper addresses robust radar-camera fusion for accurate 3D perception in autonomous driving. It introduces RCBEVDet, combining RadarBEVNet for radar BEV feature extraction with CAMF for deformable cross-attention-based fusion, and extends to RCBEVDet++ to support sparse fusion and query-based camera models across 3D detection, BEV segmentation, and 3D MOT. On nuScenes, RCBEVDet++ achieves state-of-the-art radar-camera multi-modal results, notably NDS and mAP for 3D object detection with ViT-L without test-time augmentation, along with strong BEV segmentation and MOT performance. The approach demonstrates robustness to sensor failure and broad generalization across backbones and detector architectures, highlighting its practical impact for cost-effective, reliable autonomous driving perception.

Abstract

Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.
Paper Structure (38 sections, 15 equations, 9 figures, 14 tables)

This paper contains 38 sections, 15 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overall pipeline of RCBEVDet. Firstly, multi-view images are encoded and transformed into image BEV features. Concurrently, radar point clouds are processed by the proposed RadarBEVNet to extract radar BEV features. Subsequently, features from both radar and cameras are dynamically aligned and aggregated using the Cross-Attention Multi-layer Fusion (CAMF) module. The resulting semantically rich multi-modal feature is then utilized for the 3D object detection task.
  • Figure 2: Architecture of the dual-stream radar backbone. (a) Point-based Block. (b) Transformer-based Block.
  • Figure 3: Architecture of the Injection and Extraction module. The left figure shows the details of the injection operation. The right figure displays the structure of the extraction operation.
  • Figure 4: Illustration of RCS-aware scattering. RCS-aware scattering uses RCS as the object size prior to scattering the feature of one radar point to many BEV pixels.
  • Figure 5: Cross-attention multi-layer fusion module. The BEV features from radar and cameras are dynamically aligned using deformable cross-attention. Subsequently, the multi-modal BEV features are aggregated through a channel and spatial fusion module, which consists of several Convolution-BatchNorm-ReLU (CBR) blocks.
  • ...and 4 more figures