Table of Contents
Fetching ...

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, Zhi Tang

TL;DR

BEVFusion tackles the fragility of LiDAR-camera fusion methods that rely on LiDAR inputs by decoupling the camera and LiDAR streams and projecting both into a shared BEV space before fusion. The framework employs a camera stream adapted from LSS with a Dual-Swin-Tiny backbone, a LiDAR stream using established voxelized backbones, and a Dynamic Fusion Module that adaptively fuses BEV features. It demonstrates strong generalization across LiDAR backbones and detection heads, achieves state-of-the-art performance on nuScenes, and shows substantial robustness under simulated LiDAR and camera malfunctions without post-processing. This approach enables reliable multi-modality perception in realistic autonomous driving scenarios and offers a versatile foundation for future multi-view temporal extensions and cross-modal alignment.

Abstract

Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. However, people discovered that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. This fundamentally limits the deployment capability to realistic autonomous driving scenarios. In contrast, we propose a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. We empirically show that our framework surpasses the state-of-the-art methods under the normal training settings. Under the robustness training settings that simulate various LiDAR malfunctions, our framework significantly surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. To the best of our knowledge, we are the first to handle realistic LiDAR malfunction and can be deployed to realistic scenarios without any post-processing procedure. The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.

BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework

TL;DR

BEVFusion tackles the fragility of LiDAR-camera fusion methods that rely on LiDAR inputs by decoupling the camera and LiDAR streams and projecting both into a shared BEV space before fusion. The framework employs a camera stream adapted from LSS with a Dual-Swin-Tiny backbone, a LiDAR stream using established voxelized backbones, and a Dynamic Fusion Module that adaptively fuses BEV features. It demonstrates strong generalization across LiDAR backbones and detection heads, achieves state-of-the-art performance on nuScenes, and shows substantial robustness under simulated LiDAR and camera malfunctions without post-processing. This approach enables reliable multi-modality perception in realistic autonomous driving scenarios and offers a versatile foundation for future multi-view temporal extensions and cross-modal alignment.

Abstract

Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. However, people discovered that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. This fundamentally limits the deployment capability to realistic autonomous driving scenarios. In contrast, we propose a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. We empirically show that our framework surpasses the state-of-the-art methods under the normal training settings. Under the robustness training settings that simulate various LiDAR malfunctions, our framework significantly surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. To the best of our knowledge, we are the first to handle realistic LiDAR malfunction and can be deployed to realistic scenarios without any post-processing procedure. The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.
Paper Structure (26 sections, 2 equations, 6 figures, 12 tables)

This paper contains 26 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of our framework with previous LiDAR-camera fusion methods. Previous fusion methods can be broadly categorized into (a) point-level fusion mechanism sindagi2019mvxnetvora2020pointpaintingWang2021PointAugmentingCAvora2020pointpaintingHuang2020EPNetEPYin2021MVP that project image features onto raw point clouds, and (b) feature-level fusion mechanism Chen2017Multiview3OYoo20203DCVFGJbai2022transfusionli2022deepfusion that projects LiDAR feature or proposals on each view image separately to extract RGB information. (c) In contrast, we propose a novel yet surprisingly simple framework that disentangles the camera network from LiDAR inputs.
  • Figure 2: An overview of BEVFusion framework. With point clouds and multi-view image inputs, two streams separately extract features and transform them into the same BEV space: i) the camera-view features are projected to the 3D ego-car coordinate features to generate camera BEV feature; ii) 3D backbone extracts LiDAR BEV features from point clouds. Then, a fusion module integrates the BEV features from two modalities. Finally, a task-specific head is built upon the fused BEV feature and predicts the target values of 3D objects. In detection result figures, blue boxes are predicted bounding boxes, while red circled ones are the false positive predictions.
  • Figure 3: Dynamic Fusion Module.
  • Figure 4: Visualization of predictions under robustness setting. (a) We visualize the point clouds under the BEV perspective of two settings, limited field-of-view (FOV) and LiDAR fails to receive object reflection points, where the orange box indicates the object points are dropped. Blue boxes are bounding boxes and red-circled boxes are false-positive predictions. (b) We show the predictions of the state-of-the-art method, TransFusion, and ours under three settings. Obviously, the current fusion approaches fail inevitably when the LiDAR input is missing, while our framework can leverage the camera stream to recover these objects.
  • Figure 5: Adaptive Module in FPN.
  • ...and 1 more figures