Table of Contents
Fetching ...

Progressive Multi-Modal Fusion for Robust 3D Object Detection

Rohit Mohan, Daniele Cattaneo, Florian Drews, Abhinav Valada

TL;DR

This work tackles robust 3D object detection in autonomous driving by introducing ProFusion3D, a progressive fusion framework that fuses LiDAR and multi-view camera features in both BEV and PV at intermediate feature and object-query levels. It combines an inter-intra fusion module with dual BEV/PV decoders and a joint decoder to leverage local and global context, while a self-supervised multi-modal mask modeling pre-training scheme enhances data efficiency and cross-modal representation learning. On nuScenes and Argoverse2, ProFusion3D achieves state-of-the-art performance and demonstrates strong robustness when one modality is unavailable. The approach also yields significant data-efficiency gains from pre-training and provides detailed ablations illustrating the contributions of fusion strategy, pre-training objectives, and decoder design, highlighting practical benefits for real-world autonomous driving perception.

Abstract

Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes and Argoverse2 datasets conclusively demonstrate the efficacy of ProFusion3D. Moreover, ProFusion3D is robust to sensor failure, demonstrating strong performance when only one modality is available.

Progressive Multi-Modal Fusion for Robust 3D Object Detection

TL;DR

This work tackles robust 3D object detection in autonomous driving by introducing ProFusion3D, a progressive fusion framework that fuses LiDAR and multi-view camera features in both BEV and PV at intermediate feature and object-query levels. It combines an inter-intra fusion module with dual BEV/PV decoders and a joint decoder to leverage local and global context, while a self-supervised multi-modal mask modeling pre-training scheme enhances data efficiency and cross-modal representation learning. On nuScenes and Argoverse2, ProFusion3D achieves state-of-the-art performance and demonstrates strong robustness when one modality is unavailable. The approach also yields significant data-efficiency gains from pre-training and provides detailed ablations illustrating the contributions of fusion strategy, pre-training objectives, and decoder design, highlighting practical benefits for real-world autonomous driving perception.

Abstract

Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes and Argoverse2 datasets conclusively demonstrate the efficacy of ProFusion3D. Moreover, ProFusion3D is robust to sensor failure, demonstrating strong performance when only one modality is available.

Paper Structure

This paper contains 14 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of multi-modal fusion strategies. (a) Raw input fusion: integrates sensor data directly, prone to noise. (b) Intermediate fusion: combines features in BEV or PV, may lose modality details. (c) Object Query level fusion: merges high-level semantic information, reliant on its quality. (d) Progressive fusion (proposed): incrementally integrates features from BEV and PV at intermediate and object query levels.
  • Figure 2: (a) Illustration of our proposed ProFusion3D architecture that employs progressive fusion. (b) The topology of our proposed fusion module and (c) The core component of the aforementioned fusion module.
  • Figure 2: Visualization of 3D object detection prediction of our proposed ProFusion3D on the validation set of nuScenes. Classes are color-coded as follows: rgb:red,255;green,158;blue,0 car, rgb:red,220;green,20;blue,60 barrier, rgb:red,255;green,127;blue,80 truck, rgb:red,112;green,128;blue,144 cone, rgb:red,0;green,0;blue,230 bicycle, rgb:red,47;green,79;blue,79 person.
  • Figure 3: Illustration of our multi-modal mask modeling pipeline for learning multi-modal latent representations. It patchifies/voxelizes the input modalities into tokens, masks them asymmetrically, adds noise to the unmasked tokens, and trains the model with objectives of reconstruction, denoising, and cross-modal attribute prediction.
  • Figure 3: Illustration of our in-house autonomous driving vehicle used to demonstrate the robustness of the ProFusion3D architecture on real-world scenes
  • ...and 3 more figures