Table of Contents
Fetching ...

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang

TL;DR

<3-5 sentence high-level summary> GraphBEV tackles the persistent problem of feature misalignment in LiDAR–camera BEV fusion for multi-modal 3D object detection. It introduces a LocalAlign module that enriches depth features with neighbor information via a graph-based approach and a GlobalAlign module that learns camera-LiDAR BEV offsets to remedy global misalignment; together they significantly improve robustness to projection errors. On nuScenes, GraphBEV sets a new state-of-the-art, achieving mAP 70.1% and NDS 72.9% on the validation set and showing strong gains under misalignment noise (up to +8.3%). The approach also demonstrates improved BEV map segmentation and robust performance across weather, ego distance, and object-size Variants, highlighting its practical impact for real-world autonomous driving perception systems.

Abstract

Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise.

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

TL;DR

<3-5 sentence high-level summary> GraphBEV tackles the persistent problem of feature misalignment in LiDAR–camera BEV fusion for multi-modal 3D object detection. It introduces a LocalAlign module that enriches depth features with neighbor information via a graph-based approach and a GlobalAlign module that learns camera-LiDAR BEV offsets to remedy global misalignment; together they significantly improve robustness to projection errors. On nuScenes, GraphBEV sets a new state-of-the-art, achieving mAP 70.1% and NDS 72.9% on the validation set and showing strong gains under misalignment noise (up to +8.3%). The approach also demonstrates improved BEV map segmentation and robust performance across weather, ego distance, and object-size Variants, highlighting its practical impact for real-world autonomous driving perception systems.

Abstract

Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise.
Paper Structure (21 sections, 2 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a)Feature misalignment primarily arises from overlooking the projection matrix errors between LiDAR and camera, leading to LiDAR-to-camera providing inaccurate depth for surrounding neighbors. (b) We propose GraphBEV, enhancing LiDAR-to-camera projected depth with neighboring depths through graph-based neighbor information construction for enriched contextual depth feature learning. Subsequently, we achieve the alignment of global multi-modal features by simulating LiDAR and camera BEV features' offsets and employing learnable offsets. (c) Empirical results reveal that our GraphBEV surpasses the BEVFusion bevfusion-mit on the nuScenes by a margin of 1.6% mAP on nuScenes validation dataset nuscenes and by over 8.3% on noisy misalignment settings zhujun_benchmarking.
  • Figure 2: The overview of GraphBEV framework. The LiDAR branch largely follows the baselines bevfusion-mitTransfusion to generate LiDAR BEV features. In the camera branch, we first extract camera BEV features using the proposed LocalAlign module, which aims to address local misalignment due to sensor calibration errors. Subsequently, we simulate the offset noise of LiDAR and camera BEV features, followed by aligning global multi-modal features through learnable offsets. Notably, we only add offset noise to the GlobalAlign module during training to simulate global misalignment issues. Finally, we employ a dense detection head Transfusion to accomplish the 3D detection task.
  • Figure 3: The overview of LocalAlign pipeline. The LocalAlign module addresses Local Misalignment from LiDAR-to-camera by enhancing the camera-to-BEV Transform with neighboring depth features using a KD-Tree algorithm for nearest-neighbor relations.
  • Figure 4: The overview of GlobalAlign pipeline. The GlobalAlign module addresses the issue of misalignment in LiDAR-camera BEV feature fusion. During training, we add offset noise to simulate the global misalignment problem in the camera and LiDAR BEV features. It is supervised through a simple CBR-module to learn the offsets of camera BEV features. We do not introduce noise during testing and employ learnable offsets for forward inference.