Table of Contents
Fetching ...

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

Shiming Wang, Holger Caesar, Liangliang Nan, Julian F. P. Kooij

TL;DR

This work addresses robustness to missing sensor modalities in multi-sensor 3D object detection by introducing UniBEV, which uses uniform BEV encoders and shared BEV queries to align camera and LiDAR features. It replaces modality-specific BEV pipelines with a unified, deformable-attention approach and introduces Channel Normalized Weights (CNW) to fuse available modalities gracefully. In evaluations on nuScenes, UniBEV achieves a higher average mAP across all modality combinations (e.g., $52.5\%$) than BEVFusion and MetaBEV, and ablations show that CNW fusion and shared queries improve robustness. The proposed approach offers robust, retraining-free operation across diverse hardware configurations and modalities, with code available publicly.

Abstract

Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves $52.5 \%$ mAP on average over all input combinations, significantly improving over the baselines ($43.5 \%$ mAP on average for BEVFusion, $48.7 \%$ mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code is available at https://github.com/tudelft-iv/UniBEV.

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

TL;DR

This work addresses robustness to missing sensor modalities in multi-sensor 3D object detection by introducing UniBEV, which uses uniform BEV encoders and shared BEV queries to align camera and LiDAR features. It replaces modality-specific BEV pipelines with a unified, deformable-attention approach and introduces Channel Normalized Weights (CNW) to fuse available modalities gracefully. In evaluations on nuScenes, UniBEV achieves a higher average mAP across all modality combinations (e.g., ) than BEVFusion and MetaBEV, and ablations show that CNW fusion and shared queries improve robustness. The proposed approach offers robust, retraining-free operation across diverse hardware configurations and modalities, with code available publicly.

Abstract

Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves mAP on average over all input combinations, significantly improving over the baselines ( mAP on average for BEVFusion, mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code is available at https://github.com/tudelft-iv/UniBEV.
Paper Structure (5 sections, 2 figures)

This paper contains 5 sections, 2 figures.

Figures (2)

  • Figure 1: Comparison of our UniBEV with other relevant works. (a). BEVFusion liang2022bevfusion fuses multi-modal BEV features extracted from two separate branches with concatenation. (b) MetaBEV ge2023metabev fuses multi-modal BEV features extracted from two separate branches with a fusion module consisting of several deformable attention layers. (c) Our UniBEV extracts multi-modal BEV features from their original coordinate systems with uniform BEV encoders and fuses the BEV features with the CNW module. C and L in the figure represent the input from cameras and LiDAR.
  • Figure 2: The overall architecture of the UniBEV framework. 1). Multi-view images and point clouds are processed through their respective backbones to generate multi-modal features. 2). A predefined set of grid-shaped BEV queries, shared across modalities, is utilized. Guided by these shared BEV queries, modality-specific BEV encoders further refine the camera and LiDAR features independently to establish aligned BEV features. These encoders are constructed using deformable attention modules and accept unified queries, relevant reference points, and the backbone-extracted features as inputs. 3). The camera and LiDAR BEV features are fused along the channels according to the learned CNW values.