Table of Contents
Fetching ...

Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Shanaka Ramesh Gunasekara, Wanqing Li, Philip Ogunbona, Jack Yang

TL;DR

This work addresses skeleton-based action recognition in a self-supervised setting by modeling the interaction between moving and static joints. It introduces Spatial-Temporal Joint Density (STJD), a learnable kernel-density measure, to identify discriminative prime joints and guide learning through STJD-CL (contrastive) and STJD-MP (reconstruction) frameworks. Empirical evaluations on NTU RGB+D 60/120 and PKU-MMD demonstrate significant gains over state-of-the-art self-supervised methods, including up to around 3.5–3.6 percentage points on NTU RGB+D 120 benchmarks and strong semi-supervised and transfer results. The approach provides a principled mechanism to capture joint interactions beyond predefined parts, with practical implications for robust, data-efficient skeleton-based action understanding.

Abstract

Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

TL;DR

This work addresses skeleton-based action recognition in a self-supervised setting by modeling the interaction between moving and static joints. It introduces Spatial-Temporal Joint Density (STJD), a learnable kernel-density measure, to identify discriminative prime joints and guide learning through STJD-CL (contrastive) and STJD-MP (reconstruction) frameworks. Empirical evaluations on NTU RGB+D 60/120 and PKU-MMD demonstrate significant gains over state-of-the-art self-supervised methods, including up to around 3.5–3.6 percentage points on NTU RGB+D 120 benchmarks and strong semi-supervised and transfer results. The approach provides a principled mechanism to capture joint interactions beyond predefined parts, with practical implications for robust, data-efficient skeleton-based action understanding.

Abstract

Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

Paper Structure

This paper contains 19 sections, 7 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of actionlets and prime joints for action Drink from NTU RGB+D 60. An actionlet is selected on the basis of pre-defined parts using motion. In contrast, the prime joints are detected based on the evolution of the proposed STJD. The discriminative static head joints are failed to select in actionlet, but were successfully included in the prime joints.
  • Figure 2: The network architecture of the proposed STJD-CL.Prime joints are detected via STJD. A two-stream network is used for contrastive learning, and the online stream is updated with gradients while the offline stream is updated via momentum. Adaptive transformation $\mathcal{T}_1$ is adopted from actionlet. InfoNCE loss, $L_{CL}$, is defined to contrast the representation of the entire skeleton $X_q$ with that of the prime joints in $X_k$ and $L_{RCL}$ minimize the agreement between prime and non-prime representation. The STJD module is only used in the pertaining stage. Once pre-trained, the trained encoder $f_q(.)$ is used for the downstream tasks. Since no additional modules are introduced to the $f_q(.)$, the inference computational complexity remains unchanged compared to the baseline model.
  • Figure 3: Visualization of actionlet and prime joints: The green joints represent actionlet or prime joints in the respective sequence, and the purple joints belong to non-actionlet or non-prime joints. The 0 highlights the irrelevant joints included in Actionlet.
  • Figure 4: The t-SNE visualization of embeddings on the NTU RGB+D 60 X-view benchmark. The same randomly selected 15-class samples are used for better clarity. (The ActCLR actionlet results are obtained by regenerating results with the provided code )