Table of Contents
Fetching ...

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, Si Liu

TL;DR

A novel Eliminating Conflicts Fusion method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features and achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset.

Abstract

Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

TL;DR

A novel Eliminating Conflicts Fusion method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features and achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset.

Abstract

Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.
Paper Structure (17 sections, 5 equations, 6 figures, 4 tables)

This paper contains 17 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Cross-modal conflicts hinder LiDAR-Camera 3D object detection. Green boxes represents the correct prediction. Red dotted boxes and red solid boxes represent false negative and false positive respectively. (a) The single-modal LiDAR prediction is accurate, yet extrinsic conflicts caused by the uncertain depth of images lead to false positive results in fusion predictions. (b) The single-modal camera prediction is accurate, yet inherent conflicts caused by the sparse points structure of small objects lead to false negative results in fusion predictions. Best viewed in color.
  • Figure 2: The LiDAR and camera BEV feature extraction process.
  • Figure 3: Semantic-guided flow-based alignment module.
  • Figure 4: Overview of our framework. Given inputs of point clouds and multi-view images: (I) We process them by individual LiDAR/Camera Feature Extraction Branch to obtain modal-special BEV features $\mathbf{H}_P, \mathbf{H}_I,$. (II) We utilize Multi-modal BEV Fusion Branch to align $\mathbf{H}_P$ and $\mathbf{H}_I$ by SFA module and integrate them into unified fusion BEV features $\mathbf{H}_F$. (III) We generate fusion object queries $\mathbf{Q}_F$ and modal-special object queries $\mathbf{Q}_P, \mathbf{Q}_I$ by DQR mechanism. Finally, all queries are aggregated together to predict 3D bounding boxes through a transformer decoder.
  • Figure 5: Analysis of queries set from LiDAR, camera, and fusion heatmaps. GT means ground-truth objects. Note that our DQR aims at recovering queries to match with $\widetilde{GT}_P$ and $\widetilde{GT}_I$.
  • ...and 1 more figures