Table of Contents
Fetching ...

OccMamba: Semantic Occupancy Prediction with State Space Models

Heng Li, Yuenan Hou, Xiaohan Xing, Yuexin Ma, Xiao Sun, Yanyong Zhang

TL;DR

The paper tackles semantic occupancy prediction in dense 3D voxel grids under occlusion and complex scenes, where traditional CNNs and transformer-based approaches struggle with efficiency and global context. It introduces OccMamba, a first-of-its-kind Mamba-based network that uses a height-prioritized 2D Hilbert expansion and a hierarchical Mamba encoder to capture global dependencies while preserving 3D spatial structure, complemented by a local context processor for local details. The method processes multi-modal inputs (LiDAR and images) with uncompressed voxel features, achieving state-of-the-art results on OpenOccupancy, SemanticKITTI, and SemanticPOSS, notably surpassing the previous SOTA Co-Occ on OpenOccupancy by 5.1 IoU and 4.3 mIoU, and demonstrating favorable memory and speed characteristics. The authors provide open-source code, highlighting OccMamba’s practicality for real-world deployment and its potential to drive further advances in large-scale semantic occupancy perception.

Abstract

Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. Specifically, we first design the hierarchical Mamba module and local context processor to better aggregate global and local contextual information, respectively. Besides, to relieve the inherent domain gap between the linguistic and 3D domains, we present a simple yet effective 3D-to-1D reordering scheme, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of 3D voxels as well as facilitate the processing of Mamba blocks. Endowed with the aforementioned designs, our OccMamba is capable of directly and efficiently processing large volumes of dense scene grids, achieving state-of-the-art performance across three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI, and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 5.1% IoU and 4.3% mIoU, respectively. Our implementation is open-sourced and available at: https://github.com/USTCLH/OccMamba.

OccMamba: Semantic Occupancy Prediction with State Space Models

TL;DR

The paper tackles semantic occupancy prediction in dense 3D voxel grids under occlusion and complex scenes, where traditional CNNs and transformer-based approaches struggle with efficiency and global context. It introduces OccMamba, a first-of-its-kind Mamba-based network that uses a height-prioritized 2D Hilbert expansion and a hierarchical Mamba encoder to capture global dependencies while preserving 3D spatial structure, complemented by a local context processor for local details. The method processes multi-modal inputs (LiDAR and images) with uncompressed voxel features, achieving state-of-the-art results on OpenOccupancy, SemanticKITTI, and SemanticPOSS, notably surpassing the previous SOTA Co-Occ on OpenOccupancy by 5.1 IoU and 4.3 mIoU, and demonstrating favorable memory and speed characteristics. The authors provide open-source code, highlighting OccMamba’s practicality for real-world deployment and its potential to drive further advances in large-scale semantic occupancy perception.

Abstract

Training deep learning models for semantic occupancy prediction is challenging due to factors such as a large number of occupancy cells, severe occlusion, limited visual cues, complicated driving scenarios, etc. Recent methods often adopt transformer-based architectures given their strong capability in learning input-conditioned weights and long-range relationships. However, transformer-based networks are notorious for their quadratic computation complexity, seriously undermining their efficacy and deployment in semantic occupancy prediction. Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. Specifically, we first design the hierarchical Mamba module and local context processor to better aggregate global and local contextual information, respectively. Besides, to relieve the inherent domain gap between the linguistic and 3D domains, we present a simple yet effective 3D-to-1D reordering scheme, i.e., height-prioritized 2D Hilbert expansion. It can maximally retain the spatial structure of 3D voxels as well as facilitate the processing of Mamba blocks. Endowed with the aforementioned designs, our OccMamba is capable of directly and efficiently processing large volumes of dense scene grids, achieving state-of-the-art performance across three prevalent occupancy prediction benchmarks, including OpenOccupancy, SemanticKITTI, and SemanticPOSS. Notably, on OpenOccupancy, our OccMamba outperforms the previous state-of-the-art Co-Occ by 5.1% IoU and 4.3% mIoU, respectively. Our implementation is open-sourced and available at: https://github.com/USTCLH/OccMamba.
Paper Structure (18 sections, 10 equations, 5 figures, 11 tables)

This paper contains 18 sections, 10 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: (a) Challenges in semantic occupancy prediction, (b) comparison in GPU memory consumption, and (c) performance comparison on OpenOccupancy validation set. Our OccMamba demonstrates high efficiency in handling a large number of voxel grids, outperforming all other semantic occupancy predictors, such as Co-Occ co-occ and M-CONet openoccupancy.
  • Figure 2: Schematic overview of our OccMamba. Given surround-view images and LiDAR point clouds, we first employ the 2D encoder and 3D encoder to process them, obtaining camera features and LiDAR voxel features, respectively. View transformer is utilized to project camera features to camera voxel features. The camera and LiDAR voxel features are then fused and sent to the hierarchical Mamba module, where the proposed height-prioritized 2D Hilbert expansion reordering is used to maximally utilize the spatial clues of voxels. Besides, local context processor is designed to divide the Mamba features into multiple windows along the XY plane, further enhancing the local semantic information. Eventually, the Mamba features are fed to the occupancy head, producing semantic occupancy predictions.
  • Figure 3: Comparison between different reordering schemes. (a) XYZ sequence, (b) ZXY sequence, (c) 3D Hilbert curve, and (d) our height-prioritized 2D Hilbert expansion. Small cubes of varying colors, ranging from red to purple, represent different voxels. The corresponding colored edges indicate the adjacency of voxels that become neighbors after reordering. In each sub-figure, its top half represents the isometric view, and the bottom half represents the top view. Our method maximizes z-axis proximity while also striving to preserve xy-axis proximity.
  • Figure 4: Visual comparison between OccMamba and M-CONet. From top to bottom: ground-truth, predictions of M-CONet and OccMamba. Our OccMamba makes preciser predictions than M-CONet especially in regions highlighted by red ellipses.
  • Figure 5: Reconstruction results of distant occupancy (about 40m).