Table of Contents
Fetching ...

3DInAction: Understanding Human Actions in 3D Point Clouds

Yizhak Ben-Shabat, Oren Shrout, Stephen Gould

TL;DR

The paper tackles 3D point cloud action recognition by introducing t-patches—temporally evolving local patches—that capture action dynamics without requiring ground-truth point correspondences. It presents a hierarchical t-patch network that progressively aggregates spatial and temporal information to yield per-frame action predictions, achieving notable improvements on DFAUST and IKEA ASM and competitive results on MSR-Action3D. Key contributions include the t-patch representation, a prior-free hierarchical architecture, and extensive ablations on patch design, robustness, and runtime. The work has practical impact for robust 3D action understanding in applications with imperfect or missing alignment data, and it opens avenues for learning-based, multimodal extensions in 3D video analysis.

Abstract

We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years, however, its 3D point cloud counterpart remains under-explored. This is mostly due to the inherent limitation of the point cloud data modality -- lack of structure, permutation invariance, and varying number of points -- which makes it difficult to learn a spatio-temporal representation. To address this limitation, we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block, alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets, including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction.

3DInAction: Understanding Human Actions in 3D Point Clouds

TL;DR

The paper tackles 3D point cloud action recognition by introducing t-patches—temporally evolving local patches—that capture action dynamics without requiring ground-truth point correspondences. It presents a hierarchical t-patch network that progressively aggregates spatial and temporal information to yield per-frame action predictions, achieving notable improvements on DFAUST and IKEA ASM and competitive results on MSR-Action3D. Key contributions include the t-patch representation, a prior-free hierarchical architecture, and extensive ablations on patch design, robustness, and runtime. The work has practical impact for robust 3D action understanding in applications with imperfect or missing alignment data, and it opens avenues for learning-based, multimodal extensions in 3D video analysis.

Abstract

We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years, however, its 3D point cloud counterpart remains under-explored. This is mostly due to the inherent limitation of the point cloud data modality -- lack of structure, permutation invariance, and varying number of points -- which makes it difficult to learn a spatio-temporal representation. To address this limitation, we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block, alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets, including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction.
Paper Structure (16 sections, 2 equations, 12 figures, 7 tables)

This paper contains 16 sections, 2 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: 3DinAction pipeline. Given a sequence of point clouds, a set of t-patches is extracted. The t-patches are fed into a neural network to output an embedding vector. This is done hierarchically until finally the global t-patch vectors are pooled to get a per-frame point cloud embedding which is then fed into a classifier to output an action prediction per frame.
  • Figure 2: t-patch construction and collapse. Illustration of t-patch construction (left) and collapse (right). Starting from an origin point $x_q^{0}$ we find the nearest neighbours in the next frame iteratively to construct the t-patch subset (non-black points). A collapse happens when two different origin points, $x_q^{0}$ and $x_p^{0}$, have the same nearest neighbour at some time step, $\Psi_p^3=\Psi_q^3$ here.
  • Figure 3: 3DinAction GradCAM scores. The proposed 3DinAction pipeline learns meaningful representations for prominent regions. The presented actions are jumping jacks (top row), hips (middle row), and knees (bottom row). The columns represent progressing time steps from left to right. Colormap indicates high GradCAM scores in red and low scores in blue.
  • Figure 4: IKEA ASM example with t-patches. The flip table action for the TV Bench assembly is visualization including the RGB image (top), and a grayscale 3D point cloud with t-patches (bottom). t-patches are highlighted in color. The blue is on the moving TV Bench assembly, maroon is on the moving persons arm, teal is on the static table surface, and green is on the colorful static carpet.
  • Figure 5: Bidirectional t-patch illustration. t-patches formed from start to finish are presented in light blue and the reverse-t-patches in pink. Note that the nearest neighbour in one direction is not necessarily the nearest neighbour in reverse (time step $t=4$).
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 3.1