Table of Contents
Fetching ...

FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection

Guoxin Zhang, Ziying Song, Lin Liu, Zhonghong Ou

TL;DR

The paper tackles dimension mismatch in multimodal 3D object detection by introducing FGU3R, which builds a unified 3D representation from depth-completed pseudo points and employs PRConv for fine-grained feature extraction. It then uses Cross-Attention Adaptive Fusion (CAAF) to adaptively fuse RoI features from raw and pseudo streams within a two-stage detector optimized with depth supervision. Ablation studies and experiments on KITTI and nuScenes show performance gains over state-of-the-art methods, including improvements in $AP_{3D}$ and NDS for challenging categories. This approach advances practical multimodal fusion for autonomous driving by aligning modalities in 3D space and enabling selective, fine-grained fusion at the RoI level.

Abstract

Multimodal 3D object detection has garnered considerable interest in autonomous driving. However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely, which leads to sub-optimal fusion performance. In this paper, we propose a multimodal framework FGU3R to tackle the issue mentioned above via unified 3D representation and fine-grained fusion, which consists of two important components. First, we propose an efficient feature extractor for raw and pseudo points, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal features synchronously and aggregates the features from different types of points on key points based on multimodal interaction. Second, a Cross-Attention Adaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of Interest) features adaptively via a cross-attention variant in a fine-grained manner. Together they make fine-grained fusion on unified 3D representation. The experiments conducted on the KITTI and nuScenes show the effectiveness of our proposed method.

FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection

TL;DR

The paper tackles dimension mismatch in multimodal 3D object detection by introducing FGU3R, which builds a unified 3D representation from depth-completed pseudo points and employs PRConv for fine-grained feature extraction. It then uses Cross-Attention Adaptive Fusion (CAAF) to adaptively fuse RoI features from raw and pseudo streams within a two-stage detector optimized with depth supervision. Ablation studies and experiments on KITTI and nuScenes show performance gains over state-of-the-art methods, including improvements in and NDS for challenging categories. This approach advances practical multimodal fusion for autonomous driving by aligning modalities in 3D space and enabling selective, fine-grained fusion at the RoI level.

Abstract

Multimodal 3D object detection has garnered considerable interest in autonomous driving. However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely, which leads to sub-optimal fusion performance. In this paper, we propose a multimodal framework FGU3R to tackle the issue mentioned above via unified 3D representation and fine-grained fusion, which consists of two important components. First, we propose an efficient feature extractor for raw and pseudo points, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal features synchronously and aggregates the features from different types of points on key points based on multimodal interaction. Second, a Cross-Attention Adaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of Interest) features adaptively via a cross-attention variant in a fine-grained manner. Together they make fine-grained fusion on unified 3D representation. The experiments conducted on the KITTI and nuScenes show the effectiveness of our proposed method.
Paper Structure (11 sections, 7 equations, 3 figures, 4 tables)

This paper contains 11 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Due to the discrepancy between 3D points and 2D images, the feature of dimension mismatch is hard to fusion and align efficiently, resulting in sub-optimal integration performance. (b) The unified 3D representation we employ can fine-grained fuse easily while maintaining semantic adjacency.
  • Figure 2: The overall architecture of FGU3R.The dashed line means inference-only. RPN, BEV, and A.S. represent Region Proposal Network, Brid-eye's view, and Auxiliary Supervise.
  • Figure 3: The architecture of CAAF.