Table of Contents
Fetching ...

BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation

Hiep Truong Cong, Ajay Kumar Sigatapu, Arindam Das, Yashwanth Sharma, Venkatesh Satagopan, Ganesh Sistu, Ciaran Eising

TL;DR

BEVMOSNet addresses the challenge of moving object segmentation in bird's-eye-view (BEV) by introducing a fully end-to-end multimodal fusion framework that combines camera, LiDAR, and radar data. It employs deformable multi-modal cross-attention (MDCA) for cross-sensor fusion in BEV, along with a correlation-based motion cue extractor and a dedicated MOS decoder to predict moving objects. On the nuScenes dataset, BEVMOSNet achieves state-of-the-art performance, reporting a substantial IoU improvement of $36.59\%$ over the vision-only baseline BEV-MoSeg and $2.35\%$ over the multimodal SimpleBEV extension, establishing robust motion segmentation across varying distances and conditions. The work demonstrates the practical impact of multisensor fusion for reliable BEV perception, especially under adverse weather and low-light scenarios, while noting label limitations and outlining future extensions to other dynamic classes.

Abstract

Accurate motion understanding of the dynamic objects within the scene in bird's-eye-view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision-based approaches presenting preliminary findings that significantly deteriorate in low-light, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end-to-end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross-attention-guided sensor fusion for cross-sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision-based unimodal baseline BEV-MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state-of-the-art in BEV motion segmentation.

BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation

TL;DR

BEVMOSNet addresses the challenge of moving object segmentation in bird's-eye-view (BEV) by introducing a fully end-to-end multimodal fusion framework that combines camera, LiDAR, and radar data. It employs deformable multi-modal cross-attention (MDCA) for cross-sensor fusion in BEV, along with a correlation-based motion cue extractor and a dedicated MOS decoder to predict moving objects. On the nuScenes dataset, BEVMOSNet achieves state-of-the-art performance, reporting a substantial IoU improvement of over the vision-only baseline BEV-MoSeg and over the multimodal SimpleBEV extension, establishing robust motion segmentation across varying distances and conditions. The work demonstrates the practical impact of multisensor fusion for reliable BEV perception, especially under adverse weather and low-light scenarios, while noting label limitations and outlining future extensions to other dynamic classes.

Abstract

Accurate motion understanding of the dynamic objects within the scene in bird's-eye-view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision-based approaches presenting preliminary findings that significantly deteriorate in low-light, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end-to-end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross-attention-guided sensor fusion for cross-sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision-based unimodal baseline BEV-MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state-of-the-art in BEV motion segmentation.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We propose BEVMOSNet for motion understanding within the scene. We demonstrate that our multi-modal fusion encompassing 6-cameras (C), LiDAR (L), and radar (R) yields better IoU across all distance ranges when compared to camera-only and other multimodal models. * denotes the SimpleBEV baseline model extended for the motion segmentation task.
  • Figure 2: BEVMOSNet extracts features from camera, radar, and LiDAR input and transforms them into BEV, where they are fused together by a sensor fusion module. Consequently, a correlation block is applied to the fused BEV feature maps from current and previous frames to extract motion cues, which are then combined with the current fused BEV feature map as input for the segmentation decoder.
  • Figure 3: Multimodal deformable cross attention (MDCA) extracts complementary features from camera and radar sensors individually by separately applying attention weights $\mathbf{A}_{m}$ and learnable sampling offsets $\mathbf{\triangle P}_{m,h}$ in every attention head. $\oplus$ denotes concatenation.
  • Figure 4: Qualitative results on MOS in various weather conditions. The camera-only model predicts distant moving objects with lower confidence (blurred region, marked with red circles). It also fails to segment occluded moving objects, or when operating in low light conditions, such as at night (regions marked with black circles). Generally, LiDAR helps to locate object positions and estimate object orientation accurately; radar improves the segmentation of distant objects. By combining camera, LiDAR, and radar we can leverage the advantages of each modality to build a robust model, which reduces false positive predictions (marked with green circles).