Table of Contents
Fetching ...

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, Ce Zhu

TL;DR

Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion 3D object detection results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks, and achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21∼28 FPS.

Abstract

Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice, low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However, relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically, we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based encoder and a transformer-based encoder are proposed to extract radar features, with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides, we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism, and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

TL;DR

Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion 3D object detection results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks, and achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21∼28 FPS.

Abstract

Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice, low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However, relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically, we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based encoder and a transformer-based encoder are proposed to extract radar features, with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides, we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism, and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore, RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.
Paper Structure (13 sections, 10 equations, 6 figures, 7 tables)

This paper contains 13 sections, 10 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of the proposed RCBEVDet and other real-time 3D object detectors. Our RCBEVDet achieves state-of-the-art accuracy and accuracy-speed trade-offs. All entries are evaluated on nuScenes val set, and the inference speed is benchmarked by a single RTX3090 GPU.
  • Figure 2: Overall pipeline of RCBEVDet. Firstly, multi-view images are encoded and transformed into the bird’s eye view to obtain the image BEV feature. Concurrently, radar point clouds are sent to the proposed RadarBEVNet to extract the radar BEV feature. Afterward, BEV features from radar and cameras are aligned dynamically and aggregated with the cross-attention multi-layer fusion (CAMF). The fused semantically rich multi-modal BEV feature is employed for the 3D object detection task.
  • Figure 3: Architecture of the dual-stream radar backbone.
  • Figure 4: Architecture of the Injection and Extraction module. The left figure shows the details of the injection operation. The right figure displays the structure of the extraction operation.
  • Figure 5: Illustration of RCS-aware scattering. RCS-aware scattering uses RCS as the object size prior to scatter the feature of one radar point to many BEV pixels.
  • ...and 1 more figures