Table of Contents
Fetching ...

MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

Hongyu Ke, Jack Morris, Kentaro Oguchi, Xiaofei Cao, Yongkang Liu, Haoxin Wang, Yi Ding

TL;DR

The paper tackles the high computational cost of building Bird's Eye View (BEV) representations from multi-camera inputs for 3D perception. It introduces MamBEV, a framework that leverages State Space Models (SSMs) with linear attention, including a Spatial Cross Mamba module and a Cross Quasi-separable State Space Model (XQSSM), to fuse BEV queries with image features while enabling temporal fusion. Core contributions include Reducing State Size, BEV Position Aware Merge, XQSSM, and extensive ablations demonstrating efficiency and competitive accuracy on nuScenes with reduced memory and FLOPs compared to transformer-based approaches. The work advances edge-friendly BEV perception for autonomous driving by enabling scalable, cross-modal BEV learning with open-source code for reproducibility.

Abstract

3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV's promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

TL;DR

The paper tackles the high computational cost of building Bird's Eye View (BEV) representations from multi-camera inputs for 3D perception. It introduces MamBEV, a framework that leverages State Space Models (SSMs) with linear attention, including a Spatial Cross Mamba module and a Cross Quasi-separable State Space Model (XQSSM), to fuse BEV queries with image features while enabling temporal fusion. Core contributions include Reducing State Size, BEV Position Aware Merge, XQSSM, and extensive ablations demonstrating efficiency and competitive accuracy on nuScenes with reduced memory and FLOPs compared to transformer-based approaches. The work advances edge-friendly BEV perception for autonomous driving by enabling scalable, cross-modal BEV learning with open-source code for reproducibility.

Abstract

3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV's promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

Paper Structure

This paper contains 20 sections, 7 equations, 9 figures, 13 tables, 2 algorithms.

Figures (9)

  • Figure 1: We propose MamBEV, a novel paradigm that leverages both SSM based Cross-Attention and Self-Attention mechanisms to generate BEV features from multi-camera inputs.
  • Figure 2: The overall pipeline of of our architecture (MamBEV-Small). We present a novel method for incorporating SSMs into a BEV construction algorithm. Features are extracted from six egocentric multiview camera images over multiple frames. A ResNet backbone is used to extract camera features which are passed to SSM based encoder blocks. We found that it was necessary to use full attention during the decoding process, however this has limited impact on the computational complexity as the encoded feature sequence is relatively short.
  • Figure 3: Proposed Spatial Cross Mamba using XQSSM. Our novel method to fuse two distinct spatial representations: 1) BEV queries which is a top-down representation, and 2) image features which come from an egocentric view.
  • Figure 4: Spatial Cross Mamba Pre- and Post-processing. Illustration of the processing performed on the input and output of the SSM to merge sampled information from multiple query copies in the input sequence into the BEV query grid. Image features (denoted by $v_{i,j}$) and their corresponding query vectors ($q_k$) are first interleaved to enable causal attention. Processed outputs of the SSM are normalized and fused into an updated query matrix $Q'_{BEV}$.
  • Figure 5: Visualization results of MamBEV-Small on nuScenes val set. We show the 3D bboxes predictions in multi-camera images and the bird’s-eye-view.
  • ...and 4 more figures