MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models
Jiuming Liu, Jinru Han, Lihao Liu, Angelica I. Aviles-Rivero, Chaokang Jiang, Zhe Liu, Hesheng Wang
TL;DR
The paper addresses the challenge of efficient long-sequence understanding for 4D point cloud videos, where spatial irregularity and sequence length hinder traditional CNN/Transformer backbones. It proposes Mamba4D, a pure State Space Model-based backbone that decouples space and time, using an Intra-frame Spatial Mamba for short-term local dynamics and an Inter-frame Temporal Mamba for long-range temporal dependencies, guided by an anchor-frame 4D partition and point-tube neighborhoods with 4D positional encoding. The approach achieves competitive results on MSR-Action3D, HOI4D, and Synthia4D, with substantial efficiency gains such as 87.5% GPU memory reduction and 5.36x faster inference compared to transformer-based backbones, especially for long sequences. Extensive ablations validate design choices, scanning strategies, and positional encoding, demonstrating strong scalability and practical impact for real-time and long-horizon 4D perception tasks.
Abstract
Point cloud videos can faithfully capture real-world spatial geometries and temporal dynamics, which are essential for enabling intelligent agents to understand the dynamically changing world. However, designing an effective 4D backbone remains challenging, mainly due to the irregular and unordered distribution of points and temporal inconsistencies across frames. Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). Specifically, we first disentangle space and time in 4D video sequences and then establish the spatio-temporal correlation with our designed Mamba blocks. The Intra-frame Spatial Mamba module is developed to encode locally similar geometric structures within a certain temporal stride. Subsequently, locally correlated tokens are delivered to the Inter-frame Temporal Mamba module, which integrates long-term point features across the entire video with linear complexity. Our proposed Mamba4d achieves competitive performance on the MSR-Action3D action recognition (+10.4% accuracy), HOI4D action segmentation (+0.7 F1 Score), and Synthia4D semantic segmentation (+0.19 mIoU) datasets. Especially, for long video sequences, our method has a significant efficiency improvement with 87.5% GPU memory reduction and 5.36 times speed-up. Codes will be released at https://github.com/IRMVLab/Mamba4D.
