Table of Contents
Fetching ...

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

Abstract

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Abstract

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.
Paper Structure (18 sections, 20 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 20 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of R4Det with current 4D radar-camera real-time detectors.
  • Figure 2: Overall architecture of R4Det. Our framework progressively purifies the BEV representation in three stages: i) The Panoramic Depth Fusion (PDF) module generates a geometrically-accurate BEV feature map from multi-modal inputs. ii) The Deformable Gated Temporal Fusion (DGTF) module performs pose-free alignment and integration to create a temporally consistent feature. iii) The Instance-Guided Dynamic Refinement (IGDR) module leverages 2D instance prototypes to purify the final features for 3D detection.
  • Figure 3: Overview of the Panoramic Depth Fusion (PDF) module.
  • Figure 4: Architecture of the proposed Deformable Gated Temporal Fusion (DGTF) module. DGTF consists of two specialized branches: motion-aware alignment using deformable convolution and a gated temporal update mechanism.
  • Figure 5: Overview of Instance-Guided Dynamic Refinement (IGDR) module. IGDR adaptively refines radar-camera BEV features by suppressing instance overlap contamination and cross-modality noise, while preserving reliable distant object representations.
  • ...and 3 more figures