Table of Contents
Fetching ...

MambaBEV: An efficient 3D detection model with Mamba2

Zihan You, Ni Wang, Hao Wang, Qichao Zhao, Jinxiang Wang

TL;DR

MambaBEV introduces a BEV-based 3D detection framework that leverages the Mamba2 structured state-space model for efficient long-range temporal fusion. The TemporalMamba block enables global BEV context integration by discrete BEV feature rearrangement and four-direction sequence processing, complemented by a Mamba-based DETR head for robust multi-object detection. On nuScenes, the base version achieves strong performance (NDS ≈ 51.7% and mAP ≈ 42.7%), with notable improvements in large-object detection and velocity estimation, and shows promise in end-to-end autonomous driving planning and forecasting tasks. Overall, the work demonstrates the viability of state-space models for autonomous driving perception, offering improved global context understanding and efficiency relative to traditional transformer-based temporal fusion methods.

Abstract

Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion.However, existing fusion strategies based on convolutional layers or deformable self attention struggle with global context modeling in BEV space,leading to lower accuracy for large objects. To address this, we introduce MambaBEV, a novel BEV based 3D object detection model that leverages Mamba2, an advanced state space model (SSM) optimized for long sequence processing.Our key contribution is TemporalMamba, a temporal fusion module that enhances global awareness by introducing a BEV feature discrete rearrangement mechanism tailored for Mamba's sequential processing. Additionally, we propose Mamba based DETR as the detection head to improve multi object representation.Evaluations on the nuScenes dataset demonstrate that MambaBEV base achieves an NDS of 51.7\% and an mAP of 42.7\%.Furthermore, an end to end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.Our results highlight the potential of SSMs in autonomous driving perception, particularly in enhancing global context understanding and large object detection.

MambaBEV: An efficient 3D detection model with Mamba2

TL;DR

MambaBEV introduces a BEV-based 3D detection framework that leverages the Mamba2 structured state-space model for efficient long-range temporal fusion. The TemporalMamba block enables global BEV context integration by discrete BEV feature rearrangement and four-direction sequence processing, complemented by a Mamba-based DETR head for robust multi-object detection. On nuScenes, the base version achieves strong performance (NDS ≈ 51.7% and mAP ≈ 42.7%), with notable improvements in large-object detection and velocity estimation, and shows promise in end-to-end autonomous driving planning and forecasting tasks. Overall, the work demonstrates the viability of state-space models for autonomous driving perception, offering improved global context understanding and efficiency relative to traditional transformer-based temporal fusion methods.

Abstract

Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion.However, existing fusion strategies based on convolutional layers or deformable self attention struggle with global context modeling in BEV space,leading to lower accuracy for large objects. To address this, we introduce MambaBEV, a novel BEV based 3D object detection model that leverages Mamba2, an advanced state space model (SSM) optimized for long sequence processing.Our key contribution is TemporalMamba, a temporal fusion module that enhances global awareness by introducing a BEV feature discrete rearrangement mechanism tailored for Mamba's sequential processing. Additionally, we propose Mamba based DETR as the detection head to improve multi object representation.Evaluations on the nuScenes dataset demonstrate that MambaBEV base achieves an NDS of 51.7\% and an mAP of 42.7\%.Furthermore, an end to end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.Our results highlight the potential of SSMs in autonomous driving perception, particularly in enhancing global context understanding and large object detection.

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Given an RGB image captured by six surrounding cameras, a pretrained backbone generates six feature maps. These feature maps are processed through a Feature Pyramid Network (FPN) to extract multi-scale features. Subsequently, the Special Cross Attention (SCA) module performs backward projection to produce a bird's-eye view (BEV) feature map. The TemporalMamba block then fuses historical BEV features with current BEV features, guiding the generation of new current BEV features. After several processing layers, a Mamba-based-DETR head serves as the 3D object detection head.
  • Figure 2: Proposed TemproalMamba module architecture with five parts Alignment, Compressed, Re-arrange, scan, Re-merge; Query Re-arrange and Query Re-merge is shown in Fig \ref{['Re-arrange']} and Fig \ref{['fig3']}
  • Figure 3: Query Re-arrange: BEV feature map is discretely serialized and then recombined in four directions: forward-left, forward-upward, reverse-left, and reverse-upward. The recombined methods takes into account the impact of distance on the interaction of features, and adjust the methods in a balance way
  • Figure 4: Query Re-merge: In this approach, we introduce a process termed "Query Re-merge," which serves as the inverse of the "Query Re-arrange" operation. Given four enhanced sequences, we initially segment them at the positions determined by the re-arrange operation. Subsequently, these segments are reassembled following the original partitioning scheme. To restore the sequences to their original dimensions, we apply an average calculation along the third dimension, resulting in a tensor of shape (batch size, number of queries, 256).
  • Figure 5: visualization of BEV features
  • ...and 1 more figures