Table of Contents
Fetching ...

PointMamba: A Simple State Space Model for Point Cloud Analysis

Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, Xiang Bai

TL;DR

PointMamba introduces a simple, linear-time state space model for point cloud analysis by tokenizing 3D data with Hilbert-based space-filling curves and processing serialized tokens through a vanilla Mamba encoder. It combines dual serialization (Hilbert and Trans-Hilbert), a lightweight order indicator, and a serialization-based mask modeling pretraining regime to achieve strong performance with significantly reduced memory and FLOPs compared to Transformer baselines. Across ScanObjectNN, ModelNet40, ShapeNetPart, and few-shot settings, PointMamba demonstrates competitive or superior accuracy while maintaining low computational cost, underscoring the viability of SSMs as a 3D vision baseline. The work positions Mamba as a practical, scalable alternative for 3D perception and motivates further exploration of SSM-based foundation models for 3D tasks.

Abstract

Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs. Specifically, our method leverages space-filling curves for effective point tokenization and adopts an extremely simple, non-hierarchical Mamba encoder as the backbone. Comprehensive evaluations demonstrate that PointMamba achieves superior performance across multiple datasets while significantly reducing GPU memory usage and FLOPs. This work underscores the potential of SSMs in 3D vision-related tasks and presents a simple yet effective Mamba-based baseline for future research. The code will be made available at \url{https://github.com/LMD0311/PointMamba}.

PointMamba: A Simple State Space Model for Point Cloud Analysis

TL;DR

PointMamba introduces a simple, linear-time state space model for point cloud analysis by tokenizing 3D data with Hilbert-based space-filling curves and processing serialized tokens through a vanilla Mamba encoder. It combines dual serialization (Hilbert and Trans-Hilbert), a lightweight order indicator, and a serialization-based mask modeling pretraining regime to achieve strong performance with significantly reduced memory and FLOPs compared to Transformer baselines. Across ScanObjectNN, ModelNet40, ShapeNetPart, and few-shot settings, PointMamba demonstrates competitive or superior accuracy while maintaining low computational cost, underscoring the viability of SSMs as a 3D vision baseline. The work positions Mamba as a practical, scalable alternative for 3D perception and motivates further exploration of SSM-based foundation models for 3D tasks.

Abstract

Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs. Specifically, our method leverages space-filling curves for effective point tokenization and adopts an extremely simple, non-hierarchical Mamba encoder as the backbone. Comprehensive evaluations demonstrate that PointMamba achieves superior performance across multiple datasets while significantly reducing GPU memory usage and FLOPs. This work underscores the potential of SSMs in 3D vision-related tasks and presents a simple yet effective Mamba-based baseline for future research. The code will be made available at \url{https://github.com/LMD0311/PointMamba}.
Paper Structure (24 sections, 11 equations, 17 figures, 3 tables)

This paper contains 24 sections, 11 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Comprehensive comparisons between our PointMamba and its Transformer-based counterparts liu2022maskedchen2023pointgptdong2022autoencoders. (a) Without bells and whistles, our PointMamba achieve better performance than the representative Transformer-based methods on the various point cloud analysis datasets. (b)-(d) The Transformer presents quadratic complexity, while our PointMamba has linear complexity. For example, with the length of point tokens increasing, we significantly reduce GPU memory usage and FLOPs and have a faster inference speed compared to the most convincing Transformer-based method, i.e., PointMAE liu2022masked.
  • Figure 2: The pipeline of our PointMamba. It is simple and elegant, without bells and whistles. We first utilize Farthest Point Sampling (FPS) to select the key points. Then, we propose to utilize two types of space-filling curves, including Hilbert and Trans-Hilbert, to generate the serialized key points. Based on these, the KNN is used to form point patches, which will be fed to the token embedding layer to generate the serialized point tokens. To indicate the tokens generated from which space-filling curve, the order indicator is proposed. The encoder is extremely simple, consisting of $N \times$ plain and non-hierarchical Mamba blocks.
  • Figure 3: An intuitive illustration of global modeling from PointMamba.
  • Figure 4: The details of our proposed serialization-based mask modeling. During the pre-training, we randomly choose one space-filling curve to generate the serialized point tokens for mask modeling, and different serialized point tokens have different order indicators.
  • Figure 5: Classification on ModelNet40 wu20153d. Overall accuracy (%) is reported. The results are obtained from 1024 points without voting.
  • ...and 12 more figures