Table of Contents
Fetching ...

Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

Xiaokai Bai, Lianqing Zheng, Si-Yuan Cao, Xiaohan Zhang, Zhe Wu, Beinan Yu, Fang Wang, Jie Bai, Hui-Liang Shen

TL;DR

SIFormer is proposed, a scene-instance aware transformer for 3D object detection using 4D radar and camera and achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets.

Abstract

4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.

Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection

TL;DR

SIFormer is proposed, a scene-instance aware transformer for 3D object detection using 4D radar and camera and achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets.

Abstract

4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.
Paper Structure (17 sections, 7 equations, 10 figures, 13 tables)

This paper contains 17 sections, 7 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison of radar and camera fusion pipelines. Previous radar-camera fusion models typically adopt either (a) BEV-level or (b) perspective-level fusion. Our SIFormer, with the aim of enhancing instance awareness, bridges the gap between the two paradigms while combining their complementary strengths, as shown in (c).
  • Figure 2: Visualization comparison of LiDAR and 4D radar on the VoD dataset: the first row shows LiDAR, and the second row shows 4D radar. The first column displays 3D ground truth boxes and the point cloud projection onto the foreground mask, with purple points indicating valid points providing object information. The second and third columns show the point cloud projection onto the perspective View and bird's-eye view, respectively. Dense LiDAR provides strong geometry, while sparse 4D radar only provides weak geometry.
  • Figure 3: Comparison between IS-Fusion and SIFormer. (a) IS-Fusion mines instance features directly from scene features. (b) SIFormer employs cross-view correlation to improve radar-camera fusion, addressing weak radar geometry by activating instance awareness using 2D instance features.
  • Figure 4: Architecture of our SIFormer. (a) The feature extractor extracts 4D radar and image feature from raw data. (b) The instance initialization stage filters out irrelevant features during view transformation via segmentation and depth-guided localization to focus on regions of interest introduces, while achieving global scene understanding. (c) The instance awareness enhancement stage leverages cross view correlation (CVC) to bridge perspective view instance feature with bird’s-eye view scene feature, followed by the instance enhance attention (IEA) module for further refinement, producing fused feature across scene and instance levels. (d) The decoder head for 3D object detection.
  • Figure 5: The detailed illustration of our instance initialization within scene stage. We employ sparse scene integration (SSI) to update depth and context, then fed then into hybrid view transformation to provide image BEV feature.
  • ...and 5 more figures