Table of Contents
Fetching ...

DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

Duy-Tho Le, Hengcan Shi, Jianfei Cai, Hamid Rezatofighi

TL;DR

DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation, is introduced, able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output.

Abstract

Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the inherent denoising property of diffusion, DifFUSER is able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output. In terms of architecture, our DifFUSER blocks are chained together in a hierarchical BiFPN fashion, termed cMini-BiFPN, offering an alternative architecture for latent diffusion. We further introduce a Gated Self-conditioned Modulated (GSM) latent diffusion module together with a Progressive Sensor Dropout Training (PSDT) paradigm, designed to add stronger conditioning to the diffusion process and robustness to sensor failures. Our extensive evaluations on the Nuscenes dataset reveal that DifFUSER not only achieves state-of-the-art performance with a 70.04% mIOU in BEV map segmentation tasks but also competes effectively with leading transformer-based fusion techniques in 3D object detection.

DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation

TL;DR

DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation, is introduced, able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output.

Abstract

Diffusion models have recently gained prominence as powerful deep generative models, demonstrating unmatched performance across various domains. However, their potential in multi-sensor fusion remains largely unexplored. In this work, we introduce DifFUSER, a novel approach that leverages diffusion models for multi-modal fusion in 3D object detection and BEV map segmentation. Benefiting from the inherent denoising property of diffusion, DifFUSER is able to refine or even synthesize sensor features in case of sensor malfunction, thereby improving the quality of the fused output. In terms of architecture, our DifFUSER blocks are chained together in a hierarchical BiFPN fashion, termed cMini-BiFPN, offering an alternative architecture for latent diffusion. We further introduce a Gated Self-conditioned Modulated (GSM) latent diffusion module together with a Progressive Sensor Dropout Training (PSDT) paradigm, designed to add stronger conditioning to the diffusion process and robustness to sensor failures. Our extensive evaluations on the Nuscenes dataset reveal that DifFUSER not only achieves state-of-the-art performance with a 70.04% mIOU in BEV map segmentation tasks but also competes effectively with leading transformer-based fusion techniques in 3D object detection.
Paper Structure (15 sections, 14 equations, 8 figures, 10 tables, 2 algorithms)

This paper contains 15 sections, 14 equations, 8 figures, 10 tables, 2 algorithms.

Figures (8)

  • Figure 1: We propose to use denoising diffusion process for multi-modal BEV features fusion, which are then used for 3D object detection and BEV map segmentation.
  • Figure 2: Comparison of our DifFUSER fusion module with BEV encoder of the baseline. Our fusion module's output activation map is much more expressive than the baseline's, resulting in much better performance in downstream tasks.
  • Figure 3: Our DifFUSER framework is structured to first process input data—comprising both point clouds and images—through respective backbones to create initial latent features. These features are then concatenated and fed into the DifFUSER blocks. Within these blocks, the concatenated feature is used as condition (partially masked out) to iteratively denoise the corrupted features, enhancing its quality at each step. The output feature is then used for downstream tasks.
  • Figure 3: Results for 3D object detection on NuScenes (val). Bold, blue, and underlined values indicate the best, second-best, and baseline performance, respectively. Mod. stands for Modality.
  • Figure 4: GSM diffusion block.
  • ...and 3 more figures