R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia; Yousen Tang; Yongtao Wang; Zhifeng Wang; Weijun Qin

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

Abstract

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Abstract

Paper Structure (18 sections, 20 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 20 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Camera-only 3D Object Detection
4D Radar-Camera Fusion for 3D Object Detection
Method
Overall Framework
Panoramic Depth Fusion (PDF)
Deformable Gated Temporal Fusion (DGTF)
Instance-Guided Dynamic Refinement (IGDR)
Experiments
Datasets and Metrics
Implementation Details
Main Results
Ablation
Conclusion
...and 3 more sections

Figures (8)

Figure 1: Comparison of R4Det with current 4D radar-camera real-time detectors.
Figure 2: Overall architecture of R4Det. Our framework progressively purifies the BEV representation in three stages: i) The Panoramic Depth Fusion (PDF) module generates a geometrically-accurate BEV feature map from multi-modal inputs. ii) The Deformable Gated Temporal Fusion (DGTF) module performs pose-free alignment and integration to create a temporally consistent feature. iii) The Instance-Guided Dynamic Refinement (IGDR) module leverages 2D instance prototypes to purify the final features for 3D detection.
Figure 3: Overview of the Panoramic Depth Fusion (PDF) module.
Figure 4: Architecture of the proposed Deformable Gated Temporal Fusion (DGTF) module. DGTF consists of two specialized branches: motion-aware alignment using deformable convolution and a gated temporal update mechanism.
Figure 5: Overview of Instance-Guided Dynamic Refinement (IGDR) module. IGDR adaptively refines radar-camera BEV features by suppressing instance overlap contamination and cross-modality noise, while preserving reliable distant object representations.
...and 3 more figures

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Abstract

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Authors

Abstract

Table of Contents

Figures (8)