Table of Contents
Fetching ...

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

Jisong Kim, Minjae Seong, Geonho Bang, Dongsuk Kum, Jun Won Choi

TL;DR

RCM-Fusion tackles 3D object detection with low-cost radar and camera sensors by fusing modalities at two levels: feature-level BEV transformation guided by radar and instance-level refinement using adaptive grid points. It introduces a Radar Guided BEV Encoder with a Radar-Guided BEV Query and a Radar-Camera Gating mechanism to produce dense BEV features for initial proposals, followed by a Proposal-aware Radar Attention and adaptive Grid Point Pooling for refinement. Ablation studies on nuScenes show the feature-level RGBQ contribution is the largest, with incremental gains from RCG, RGPP, and PRA; overall achieving state-of-the-art among single-frame radar-camera fusion methods with significant mAP and NDS improvements. The approach enhances localization precision and demonstrates strong potential for future multi-frame fusion, with public code to enable replication.

Abstract

While LiDAR sensors have been successfully applied to 3D object detection, the affordability of radar and camera sensors has led to a growing interest in fusing radars and cameras for 3D object detection. However, previous radar-camera fusion models were unable to fully utilize the potential of radar information. In this paper, we propose Radar-Camera Multi-level fusion (RCM-Fusion), which attempts to fuse both modalities at both feature and instance levels. For feature-level fusion, we propose a Radar Guided BEV Encoder which transforms camera features into precise BEV representations using the guidance of radar Bird's-Eye-View (BEV) features and combines the radar and camera BEV features. For instance-level fusion, we propose a Radar Grid Point Refinement module that reduces localization error by accounting for the characteristics of the radar point clouds. The experiments conducted on the public nuScenes dataset demonstrate that our proposed RCM-Fusion achieves state-of-the-art performances among single frame-based radar-camera fusion methods in the nuScenes 3D object detection benchmark. Code will be made publicly available.

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

TL;DR

RCM-Fusion tackles 3D object detection with low-cost radar and camera sensors by fusing modalities at two levels: feature-level BEV transformation guided by radar and instance-level refinement using adaptive grid points. It introduces a Radar Guided BEV Encoder with a Radar-Guided BEV Query and a Radar-Camera Gating mechanism to produce dense BEV features for initial proposals, followed by a Proposal-aware Radar Attention and adaptive Grid Point Pooling for refinement. Ablation studies on nuScenes show the feature-level RGBQ contribution is the largest, with incremental gains from RCG, RGPP, and PRA; overall achieving state-of-the-art among single-frame radar-camera fusion methods with significant mAP and NDS improvements. The approach enhances localization precision and demonstrates strong potential for future multi-frame fusion, with public code to enable replication.

Abstract

While LiDAR sensors have been successfully applied to 3D object detection, the affordability of radar and camera sensors has led to a growing interest in fusing radars and cameras for 3D object detection. However, previous radar-camera fusion models were unable to fully utilize the potential of radar information. In this paper, we propose Radar-Camera Multi-level fusion (RCM-Fusion), which attempts to fuse both modalities at both feature and instance levels. For feature-level fusion, we propose a Radar Guided BEV Encoder which transforms camera features into precise BEV representations using the guidance of radar Bird's-Eye-View (BEV) features and combines the radar and camera BEV features. For instance-level fusion, we propose a Radar Grid Point Refinement module that reduces localization error by accounting for the characteristics of the radar point clouds. The experiments conducted on the public nuScenes dataset demonstrate that our proposed RCM-Fusion achieves state-of-the-art performances among single frame-based radar-camera fusion methods in the nuScenes 3D object detection benchmark. Code will be made publicly available.
Paper Structure (15 sections, 8 equations, 5 figures, 3 tables)

This paper contains 15 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Radar-camera fusion pipeline: Previous radar-camera fusion methods have primarily focused on fusion at single levels, such as (a) or (b). These approaches inherently suffer from limitations due to this level-specific fusion approach. To address this issue, we introduce a novel multi-level fusion method (c). By integrating multimodal information at multiple levels, this novel approach effectively overcomes the limitations of previous single-level fusion methods.
  • Figure 2: Overall architecture of the proposed RCM-Fusion: RCM-Fusion uses each backbone network to obtain the radar features and camera features. The multi-modal features are then fused in the Radar Guided BEV Encoder module for feature-level fusion. For instance-level fusion, the Radar Grid Point Refinement module refines the initial results with a novel grid feature pooling method.
  • Figure 3: Visualization of the BEV feature map: Left: BEV feature map from BEVFormer-S BEVFormer, a single-frame version of BEVFormer. Right: BEV feature map obtained from Radar Guided BEV Encoder. Compared to the left figure, the right figure shows that the positional information provided by the radar BEV feature map allows the features to better localize the regions where objects exist.
  • Figure 4: Illustration of novel adaptive grid point sampling: For each radar point associated with the proposal, $T$ grid points are generated in a tangential direction.
  • Figure 5: Qualitative results: We visualized the predictions of RCM-Fusion (green boxes) and camera-only baseline model, BEVFormer-S (blue boxes), alongside the GT (red boxes) on a bird's eye view. The figures demonstrate that the RCM-Fusion produces more accurate results (in the orange regions), and it can detect the objects the camera-only baseline fails to identify (in the yellow regions).