OccMamba: Semantic Occupancy Prediction with State Space Models
Heng Li, Yuenan Hou, Xiaohan Xing, Yuexin Ma, Xiao Sun, Yanyong Zhang
TL;DR
The paper tackles semantic occupancy prediction in dense 3D voxel grids under occlusion and complex scenes, where traditional CNNs and transformer-based approaches struggle with efficiency and global context. It introduces OccMamba, a first-of-its-kind Mamba-based network that uses a height-prioritized 2D Hilbert expansion and a hierarchical Mamba encoder to capture global dependencies while preserving 3D spatial structure, complemented by a local context processor for local details. The method processes multi-modal inputs (LiDAR and images) with uncompressed voxel features, achieving state-of-the-art results on OpenOccupancy, SemanticKITTI, and SemanticPOSS, notably surpassing the previous SOTA Co-Occ on OpenOccupancy by 5.1 IoU and 4.3 mIoU, and demonstrating favorable memory and speed characteristics. The authors provide open-source code, highlighting OccMamba’s practicality for real-world deployment and its potential to drive further advances in large-scale semantic occupancy perception.
Abstract
Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. Specifically, we first design the hierarchical Mamba module and local context processor to better aggregate global and local contextual information, respectively. Besides, to relieve the inherent domain gap between the linguistic and 3D domains, we present a simple yet effective 3D-to-1D reordering scheme, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of 3D voxels as well as facilitate the processing of Mamba blocks. Endowed with the aforementioned designs, our OccMamba is capable of directly and efficiently processing large volumes of dense scene grids, achieving state-of-the-art performance across three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI, and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 5.1% IoU and 4.3% mIoU, respectively. Our implementation is open-sourced and available at: https://github.com/USTCLH/OccMamba.
