Table of Contents
Fetching ...

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang

TL;DR

BEVDilation tackles multi-modal 3D detection by adopting a LiDAR-centric BEV backbone guided by image features. It introduces two novel blocks, SVDB and SBDB, to densify sparse voxel regions and diffusion-feature diffusion with semantic guidance, reducing depth-based misalignment and noise impact. On nuScenes, it achieves state-of-the-art results with competitive efficiency, and ablations confirm the effectiveness of image-guided densification and deformable diffusion. The approach demonstrates strong robustness to depth noise and highlights practical benefits for robust, real-world autonomous driving perception.

Abstract

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

TL;DR

BEVDilation tackles multi-modal 3D detection by adopting a LiDAR-centric BEV backbone guided by image features. It introduces two novel blocks, SVDB and SBDB, to densify sparse voxel regions and diffusion-feature diffusion with semantic guidance, reducing depth-based misalignment and noise impact. On nuScenes, it achieves state-of-the-art results with competitive efficiency, and ablations confirm the effectiveness of image-guided densification and deformable diffusion. The approach demonstrates strong robustness to depth noise and highlights practical benefits for robust, real-world autonomous driving perception.

Abstract

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.

Paper Structure

This paper contains 15 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: 3D object detection performance (NDS) vs speed (FPS) on nuScenes validation set.
  • Figure 2: Comparison of (a) indiscriminate fusion and (b) our LiDAR-centric strategy.
  • Figure 3: The overall architecture of our proposed BEVDilation. Given the point clouds and multi-view images, we take two individual backbones to extract multi-modal BEV features. For the LiDAR branch, we enhance the LiDAR BEV features with our proposed Sparse Voxel Dilation Block and Semantic-Guided BEV Dilation Block.
  • Figure 4: An illustration of SVDB. The newly padded and original voxels are merged with global receptive fields.
  • Figure 5: Visualization of sampling locations of SBDB at different stages. The black dot indicates the object center, and these red dots denote the sampling locations of this center query point in SBDB.
  • ...and 1 more figures