MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations
Hongyu Ke, Jack Morris, Kentaro Oguchi, Xiaofei Cao, Yongkang Liu, Haoxin Wang, Yi Ding
TL;DR
The paper tackles the high computational cost of building Bird's Eye View (BEV) representations from multi-camera inputs for 3D perception. It introduces MamBEV, a framework that leverages State Space Models (SSMs) with linear attention, including a Spatial Cross Mamba module and a Cross Quasi-separable State Space Model (XQSSM), to fuse BEV queries with image features while enabling temporal fusion. Core contributions include Reducing State Size, BEV Position Aware Merge, XQSSM, and extensive ablations demonstrating efficiency and competitive accuracy on nuScenes with reduced memory and FLOPs compared to transformer-based approaches. The work advances edge-friendly BEV perception for autonomous driving by enabling scalable, cross-modal BEV learning with open-source code for reproducibility.
Abstract
3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV's promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.
