Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

Xinpeng Yin; Wenming Cao

Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

Xinpeng Yin, Wenming Cao

TL;DR

This work tackles self-supervised skeleton-based action recognition by introducing HA-CM, a Hierarchy & Attention guided Cross Masking framework that masks skeleton sequences from both spatial and temporal perspectives. By mapping Euclidean joint embeddings into hyperbolic space, HA-CM preserves the skeletal hierarchy and uses a cross-masking scheme with odd–even cross-grouping and Gumbel-Max sampling to encourage learning of both intra-sample details and instance-level features via a cross-contrast loss. The approach achieves strong results on NTU-60, NTU-120, and PKU-MMD, with extensive ablations confirming the contributions of hyperbolic mapping, dual masking, and the cross-contrast objective. Overall, HA-CM advances self-supervised skeleton action recognition by integrating hierarchical geometry with cross-domain masking to improve robustness and transferability.

Abstract

In self-supervised skeleton-based action recognition, the mask reconstruction paradigm is gaining interest in enhancing model refinement and robustness through effective masking. However, previous works primarily relied on a single masking criterion, resulting in the model overfitting specific features and overlooking other effective information. In this paper, we introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives. Specifically, in spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons, employing joint hierarchy as the masking criterion. In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective. Additionally, we incorporate cross-contrast loss based on the cross-masking framework into the loss function to enhance the model's learning of instance-level features. HA-CM shows efficiency and universality on three public large-scale datasets, NTU-60, NTU-120, and PKU-MMD. The source code of our HA-CM is available at https://github.com/YinxPeng/HA-CM-main.

Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

TL;DR

Abstract

Paper Structure (21 sections, 25 equations, 6 figures, 8 tables)

This paper contains 21 sections, 25 equations, 6 figures, 8 tables.

Introduction
Related Work
Self-supervised Skeleton Representation Learning
Hyperbolic Feature Embedding
Preliminaries
Notations
Hyperbolic Learning
Method
Pipeline Overview
Prior Refinement
Positional Embedding
Hyperbolic Mapping
Cross Mask $\&$ Reconstruction
Cross Contrast Loss $\&$ Reconstruction Loss
Experiments
...and 6 more sections

Figures (6)

Figure 1: (Match components with colors)Architecture Overview of HA-CM. The symbols $\textbf{E}_{e}$, $\textbf{E}$ and $\textbf{E}_{\mathbb{P}}$ correspond to those used in the main text. Note that the entire sequence is embedded in a high-dimensional hyperbolic space along the spatial dimension, while the computation of the masking criteria that determines which joints in different components should be masked is performed in a different space.
Figure 2: The details of CM&R.The green line corresponds to the spatial aspect, while the orange line represents strategy 1 of the temporal aspect. The right side of the figure illustrates the masking and reconstruction process after the joints are embedded in hyperbolic space, incorporating randomness.
Figure 3: Joint removal in the spatial pruning module using the NTU dataset. Blue joints are retained, red joints are removed, and $J^\prime$ denotes the number of joints after pruning.
Figure 4: Model Performance vs Temperature Coefficient ($\tau$). The error bars obtained from training each parameter three runs.
Figure 5: Confusion matrix on NTU-60 Xview. The baseline is MAMP.
...and 1 more figures

Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

TL;DR

Abstract

Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)