Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition
Xinpeng Yin, Wenming Cao
TL;DR
This work tackles self-supervised skeleton-based action recognition by introducing HA-CM, a Hierarchy & Attention guided Cross Masking framework that masks skeleton sequences from both spatial and temporal perspectives. By mapping Euclidean joint embeddings into hyperbolic space, HA-CM preserves the skeletal hierarchy and uses a cross-masking scheme with odd–even cross-grouping and Gumbel-Max sampling to encourage learning of both intra-sample details and instance-level features via a cross-contrast loss. The approach achieves strong results on NTU-60, NTU-120, and PKU-MMD, with extensive ablations confirming the contributions of hyperbolic mapping, dual masking, and the cross-contrast objective. Overall, HA-CM advances self-supervised skeleton action recognition by integrating hierarchical geometry with cross-domain masking to improve robustness and transferability.
Abstract
In self-supervised skeleton-based action recognition, the mask reconstruction paradigm is gaining interest in enhancing model refinement and robustness through effective masking. However, previous works primarily relied on a single masking criterion, resulting in the model overfitting specific features and overlooking other effective information. In this paper, we introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives. Specifically, in spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons, employing joint hierarchy as the masking criterion. In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective. Additionally, we incorporate cross-contrast loss based on the cross-masking framework into the loss function to enhance the model's learning of instance-level features. HA-CM shows efficiency and universality on three public large-scale datasets, NTU-60, NTU-120, and PKU-MMD. The source code of our HA-CM is available at https://github.com/YinxPeng/HA-CM-main.
