Table of Contents
Fetching ...

ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Aman Anand, Amir Eskandari, Elyas Rahsno, Farhana Zulkernine

TL;DR

ASMa tackles biased representation learning in skeleton-based self-supervised action recognition by introducing asymmetric spatio-temporal masking guided by joint degree and motion. The framework trains two ST-GCN encoders with complementary masks, integrates their diverse representations via a feature alignment module, and distills the combined knowledge into a lightweight student for edge deployment. Empirical results on NTU-60, NTU-120, and PKU-MMD show consistent gains over prior SSL methods (2.7–4.4% FT, up to 5.9% transfer) and competitive performance versus supervised baselines, with a distilled model achieving a 91.4% parameter reduction and 3x faster edge inference. The work demonstrates practical deployment potential and highlights that distillation from a linear-probed teacher can yield compact, generalizable skeleton representations.

Abstract

Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.

ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

TL;DR

ASMa tackles biased representation learning in skeleton-based self-supervised action recognition by introducing asymmetric spatio-temporal masking guided by joint degree and motion. The framework trains two ST-GCN encoders with complementary masks, integrates their diverse representations via a feature alignment module, and distills the combined knowledge into a lightweight student for edge deployment. Empirical results on NTU-60, NTU-120, and PKU-MMD show consistent gains over prior SSL methods (2.7–4.4% FT, up to 5.9% transfer) and competitive performance versus supervised baselines, with a distilled model achieving a 91.4% parameter reduction and 3x faster edge inference. The work demonstrates practical deployment potential and highlights that distillation from a linear-probed teacher can yield compact, generalizable skeleton representations.

Abstract

Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
Paper Structure (33 sections, 14 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 33 sections, 14 equations, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Skeleton graph with joints color-coded by degree centrality. (b) Bar plots show the average motion intensity of each joint for different actions across 50 frames. Through this analysis, we observe that low-degree joints and their adjacent joints exhibit high motion across actions.
  • Figure 2: Overview of the ASMa framework. (a) Shows High-Degree Spatial masking (HDSM) and Low-Degree Spatial masking (LDSM) for joint and High-Motion Temporal masking (HMTM) and Low-Motion Temporal masking (LMTM) for frames. (b) Inputs are split into 3 streams with random augmentation $\mathcal{T}$ and masked asymmetrically along spatial and temporal stream. (c) Each encoder processes the triplet streams and is trained using Barlow Twins loss. (d) Learned features from both encoders are fused via a feature alignment module for downstream evaluation. (e) ASMa-Distill: A lightweight student model learns from the frozen ASMa teacher logits.
  • Figure 3: Ablation on different pairs of masking strategies. Refer to Appendix \ref{['sec:appen_B']} for other benchmarks.
  • Figure 4: Left: We keep the number of ST-GCN layer in $f_s$ to 5 and vary $\tau$ to test temperature sensitivity of distillation. Right: We vary the number number of ST-GCN layers in $f_s$ to test compression sensitivity.
  • Figure 5: t-SNE embedding projections of 9 randomly selected classes from NTU60-xsub (test set). Refer to Appendix \ref{['sec:appen_tsne']} for other classes.
  • ...and 2 more figures