MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Jiuming Liu; Jinru Han; Lihao Liu; Angelica I. Aviles-Rivero; Chaokang Jiang; Zhe Liu; Hesheng Wang

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang

TL;DR

The paper addresses the challenge of efficient long-sequence understanding for 4D point cloud videos, where spatial irregularity and sequence length hinder traditional CNN/Transformer backbones. It proposes Mamba4D, a pure State Space Model-based backbone that decouples space and time, using an Intra-frame Spatial Mamba for short-term local dynamics and an Inter-frame Temporal Mamba for long-range temporal dependencies, guided by an anchor-frame 4D partition and point-tube neighborhoods with 4D positional encoding. The approach achieves competitive results on MSR-Action3D, HOI4D, and Synthia4D, with substantial efficiency gains such as 87.5% GPU memory reduction and 5.36x faster inference compared to transformer-based backbones, especially for long sequences. Extensive ablations validate design choices, scanning strategies, and positional encoding, demonstrating strong scalability and practical impact for real-time and long-horizon 4D perception tasks.

Abstract

Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up. Codes will be released at https://github.com/IRMVLab/Mamba4D.

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 7 figures, 9 tables)

This paper contains 12 sections, 3 equations, 7 figures, 9 tables.

Introduction
Related work
Methodology
Anchor Frame-based 4D Video Partition
Intra-frame Spatial Mamba
Inter-frame Temporal Mamba
Experiment
3D Action Recognition
4D Action Segmentation
4D Semantic Segmentation
Ablation Studies
Conclusion

Figures (7)

Figure 1: Comparison with previous 4D backbones. Recent SOTA works fan2021pointfan2022pointwen2022point mostly leverage the combination of convolution and transformer to capture the short-term and long-term dynamics, respectively. Our Mamba4D instead utilizes a unified spatio-temporal Mamba module for efficient 4D processing.
Figure 2: Efficiency comparison with recent SOTA 4D backbones. We substitute the CNN and transformer backbones in P4Transformer fan2021point with our proposed spatio-temporal Mamba models, which leads to 87.5% GPU memory reduction and 5.36$\times$ faster runtime. This reveals the great scalability potential of our method for processing long-sequence 4D videos.
Figure 3: The overview of our Mamba4D. To capture hierarchical 4D video dynamics, we design an Intra-frame Spatial Mamba on short-term video clips for local dynamic structures and an Inter-frame Temporal Mamba on the entire video sequence for global video understanding. Various spatio-temporal scanning strategies are proposed to better establish 4D correlation.
Figure 4: Different spatio-temporal ordering strategies in Inter-frame Temporal Mamba. We first spatially order all the input point frames according to the X, Y, and Z coordinates. Then, point tokens are scanned by temporal sequences or cross-temporal sequences.
Figure 5: Visualization of the action segmentation on HOI4D. Here, we display consecutive frame sequences when a person picks and places a mug on the HOI4D dataset.
...and 2 more figures

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

TL;DR

Abstract

MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)