Table of Contents
Fetching ...

ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding

Linshuang Diao, Sensen Song, Yurong Qian, Dayong Ren

TL;DR

ZigzagPointMamba tackles two bottlenecks in point cloud self-supervised learning: spatial discontinuities from traditional token scanning and inadequate local semantics in masking. It introduces a 3D zigzag scan over XY, XZ, and YZ planes to create spatially coherent token sequences, paired with a Semantic-Siamese Masking Strategy that masks semantically redundant tokens based on token similarity and redundancy metrics. The combination yields improved representations, achieving notable gains on ShapeNetPart (Part segmentation), ModelNet40 (classification), and ScanObjectNN subsets, demonstrating better generalization and robustness. By uniting spatial continuity with semantic-aware masking in a linear-time state-space framework, ZigzagPointMamba offers an efficient backbone for large-scale 3D understanding in self-supervised learning scenarios.

Abstract

State Space models (SSMs) such as PointMamba enable efficient feature extraction for point cloud self-supervised learning with linear complexity, outperforming Transformers in computational efficiency. However, existing PointMamba-based methods depend on complex token ordering and random masking, which disrupt spatial continuity and local semantic correlations. We propose ZigzagPointMamba to tackle these challenges. The core of our approach is a simple zigzag scan path that globally sequences point cloud tokens, enhancing spatial continuity by preserving the proximity of spatially adjacent point tokens. Nevertheless, random masking undermines local semantic modeling in self-supervised learning. To address this, we introduce a Semantic-Siamese Masking Strategy (SMS), which masks semantically similar tokens to facilitate reconstruction by integrating local features of original and similar tokens. This overcomes the dependence on isolated local features and enables robust global semantic modeling. Our pre-trained ZigzagPointMamba weights significantly improve downstream tasks, achieving a 1.59% mIoU gain on ShapeNetPart for part segmentation, a 0.4% higher accuracy on ModelNet40 for classification, and 0.19%, 1.22%, and 0.72% higher accuracies respectively for the classification tasks on the OBJ-BG, OBJ-ONLY, and PB-T50-RS subsets of ScanObjectNN.

ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding

TL;DR

ZigzagPointMamba tackles two bottlenecks in point cloud self-supervised learning: spatial discontinuities from traditional token scanning and inadequate local semantics in masking. It introduces a 3D zigzag scan over XY, XZ, and YZ planes to create spatially coherent token sequences, paired with a Semantic-Siamese Masking Strategy that masks semantically redundant tokens based on token similarity and redundancy metrics. The combination yields improved representations, achieving notable gains on ShapeNetPart (Part segmentation), ModelNet40 (classification), and ScanObjectNN subsets, demonstrating better generalization and robustness. By uniting spatial continuity with semantic-aware masking in a linear-time state-space framework, ZigzagPointMamba offers an efficient backbone for large-scale 3D understanding in self-supervised learning scenarios.

Abstract

State Space models (SSMs) such as PointMamba enable efficient feature extraction for point cloud self-supervised learning with linear complexity, outperforming Transformers in computational efficiency. However, existing PointMamba-based methods depend on complex token ordering and random masking, which disrupt spatial continuity and local semantic correlations. We propose ZigzagPointMamba to tackle these challenges. The core of our approach is a simple zigzag scan path that globally sequences point cloud tokens, enhancing spatial continuity by preserving the proximity of spatially adjacent point tokens. Nevertheless, random masking undermines local semantic modeling in self-supervised learning. To address this, we introduce a Semantic-Siamese Masking Strategy (SMS), which masks semantically similar tokens to facilitate reconstruction by integrating local features of original and similar tokens. This overcomes the dependence on isolated local features and enables robust global semantic modeling. Our pre-trained ZigzagPointMamba weights significantly improve downstream tasks, achieving a 1.59% mIoU gain on ShapeNetPart for part segmentation, a 0.4% higher accuracy on ModelNet40 for classification, and 0.19%, 1.22%, and 0.72% higher accuracies respectively for the classification tasks on the OBJ-BG, OBJ-ONLY, and PB-T50-RS subsets of ScanObjectNN.

Paper Structure

This paper contains 20 sections, 9 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: As can be seen from Fig. \ref{['1']} (a), compared with ACT, PointMamba, and GeoMask3D, our proposed ZigzagPointMamba performs better on the ScanObjectNN dataset. Fig. \ref{['1']} (b) presents a stark contrast between the effects of SMS and random masking, highlighting the superiority of our proposed method in terms of reconstruction. Fig. \ref{['1']} (c) demonstrates the features before and after fine-tuning, indicating the effectiveness of our method in refining feature representations.
  • Figure 2: ZigzagPointMamba pre-training pipeline.Select key point cloud points with FPS. Extract feature labels via KNN algorithm and lightweight PointNet. Serialize using the zigzag scan path. Input serialized features into a point cloud MAE architecture with SMS for training, obtaining point cloud feature representations and providing parameters for downstream tasks.
  • Figure 3: Comparison of 2D and 3D zigzag. The 3D strategy scans on multiple planes. As an extension of the 2D one, it aids the model in preserving spatial proximity.
  • Figure 4: Details of Masking. Leverage SMS to mask out tokens with high semantic feature similarity in the point cloud. Then, apply random masking to a subset of the remaining tokens to enhance the robustness of the pre-training model.