3DInAction: Understanding Human Actions in 3D Point Clouds
Yizhak Ben-Shabat, Oren Shrout, Stephen Gould
TL;DR
The paper tackles 3D point cloud action recognition by introducing t-patches—temporally evolving local patches—that capture action dynamics without requiring ground-truth point correspondences. It presents a hierarchical t-patch network that progressively aggregates spatial and temporal information to yield per-frame action predictions, achieving notable improvements on DFAUST and IKEA ASM and competitive results on MSR-Action3D. Key contributions include the t-patch representation, a prior-free hierarchical architecture, and extensive ablations on patch design, robustness, and runtime. The work has practical impact for robust 3D action understanding in applications with imperfect or missing alignment data, and it opens avenues for learning-based, multimodal extensions in 3D video analysis.
Abstract
We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years, however, its 3D point cloud counterpart remains under-explored. This is mostly due to the inherent limitation of the point cloud data modality -- lack of structure, permutation invariance, and varying number of points -- which makes it difficult to learn a spatio-temporal representation. To address this limitation, we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block, alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets, including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction.
