Table of Contents
Fetching ...

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

Xin Jin, Haisheng Su, Kai Liu, Cong Ma, Wei Wu, Fei Hui, Junchi Yan

TL;DR

LiDAR 3D object detection benefits from both local spatial detail and global context but traditional backbones struggle with locality preservation and computational efficiency. The authors introduce UniMamba, a unified backbone that fuses 3D submanifold convolution with bidirectional State Space Models through Spatial Locality Modeling, Complementary Z-order Serialization, and a Local-Global Sequential Aggregator in an encoder-decoder architecture. Experiments on nuScenes, Waymo, and Argoverse 2 demonstrate state-of-the-art performance (e.g., $70.2$ mAP on nuScenes test) with competitive compute (~$61.9$ GFLOPs), and ablations confirm the contributions of SLM, Z-order serialization, and LGSA. This approach offers a scalable, efficient path to flexible receptive fields in LiDAR-based 3D object detection, enabling robust performance across object sizes and scenarios.

Abstract

Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform "local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and global" spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

TL;DR

LiDAR 3D object detection benefits from both local spatial detail and global context but traditional backbones struggle with locality preservation and computational efficiency. The authors introduce UniMamba, a unified backbone that fuses 3D submanifold convolution with bidirectional State Space Models through Spatial Locality Modeling, Complementary Z-order Serialization, and a Local-Global Sequential Aggregator in an encoder-decoder architecture. Experiments on nuScenes, Waymo, and Argoverse 2 demonstrate state-of-the-art performance (e.g., mAP on nuScenes test) with competitive compute (~ GFLOPs), and ablations confirm the contributions of SLM, Z-order serialization, and LGSA. This approach offers a scalable, efficient path to flexible receptive fields in LiDAR-based 3D object detection, enabling robust performance across object sizes and scenarios.

Abstract

Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform "local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and global" spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.

Paper Structure

This paper contains 25 sections, 7 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Comparison of different 3D backbones. (a) Transformer-based backbone using local window grouping. (b) SSM-based backbone using global sequence grouping. (c) Our proposed UniMamba backbone using channel-wise local-global grouping.
  • Figure 1: Visualization comparison of detection results between our UniMamba and SAFDNet zhang2024safdnet on the nuScenes val set. Blue indicates the prediction bounding box and Red indicates the ground-truth bounding box. The superior detection results are best viewed in Red Rectangle.
  • Figure 2: Illustration of our proposed UniMamba backbone, which consists of multiple stages, and each stage includes several UniMamba blocks to encode multi-scale features with an encoder-decoder architecture through down/up-sampling and stacking operations. The UniMamba Block is our core component, which efficiently enables simultaneous extraction and aggregation of local and global contextual information. In UniMamba, We first voxelize the point clouds, then adopt the proposed UniMamba 3D backbone to extract multi-scale rich spatial contextual features. Finally, these enhanced features are fed into a BEV backbone and a detection head for final 3D object detection.
  • Figure 3: Illustration of Local-Global Sequential Aggregator. Local Sequential Encoder (LSE) adopts the bidirectional SSM to handle the multiple 1D groups respectively with the proposed complementary Z-order serialization both vertically and horizontally. Instead, Global Sequential Encoder (GSE) handles a single group without sequence partition to capture the global inter-dependencies. Then the Local-Global Sequential Aggregator (LGSA) combines these two encoders in a multi-head format through channel grouping, which can model both local structure details and global context information simultaneously.