Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Jiahang Zhang; Lilang Lin; Shuai Yang; Jiaying Liu

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu

TL;DR

This work surveys self-supervised learning for skeleton-based action understanding, highlighting unique challenges posed by sparse spatial structure and temporal dynamics. It organizes existing methods into context-based, generative, and contrastive paradigms, and introduces PCM$^{3}$++—a versatile framework that jointly learns joint-, clip-, and sequence-level representations by integrating contrastive learning with masked skeleton modeling, aided by prompts and a post-distillation refinement. The authors provide a first multi-task benchmark across prominent skeleton datasets and backbones, demonstrating improved generalization to recognition, retrieval, detection, and few-shot tasks. They conclude with practical guidance and future directions, including long-term motion reasoning, multi-modal pre-training, and robustness in the wild, to advance versatile skeleton representation learning.

Abstract

Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

TL;DR

++—a versatile framework that jointly learns joint-, clip-, and sequence-level representations by integrating contrastive learning with masked skeleton modeling, aided by prompts and a post-distillation refinement. The authors provide a first multi-task benchmark across prominent skeleton datasets and backbones, demonstrating improved generalization to recognition, retrieval, detection, and few-shot tasks. They conclude with practical guidance and future directions, including long-term motion reasoning, multi-modal pre-training, and robustness in the wild, to advance versatile skeleton representation learning.

Abstract

Paper Structure (25 sections, 11 equations, 7 figures, 7 tables)

This paper contains 25 sections, 11 equations, 7 figures, 7 tables.

Introduction
Review on Skeleton-Based Action Representation SSL
Human Skeleton Representation
SSL Methods for Skeleton
Context-Based Methods
Generative Learning Methods
Contrastive-Learning Methods
Summary and Discussion
The Proposed Method
Motivation
Skeleton Contrastive Learning
Masked Skeleton Prediction
On the Connection of Contrastive Learning and Masked Prediction
The Whole Training Strategy
Discussion
...and 10 more sections

Figures (7)

Figure 1: The taxonomy framework for self-supervised skeleton-based representation learning in our survey. The survey is structured around three dimensions: skeleton data collection, SSL pretext design, and SSL downstream task evaluation, providing a comprehensive review.
Figure 2: The taxonomy of the skeleton-based self-supervised learning methods in our review.
Figure 3: Three types of context-based SSL methods for skeleton.
Figure 4: Different representations of skeleton data. From left to right are time series, 2D pseudo-image, spatial-temporal graph.
Figure 5: Summary of the advantages/limitations of different skeleton-based SSL methodologies.
...and 2 more figures

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

TL;DR

Abstract

Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond

Authors

TL;DR

Abstract

Table of Contents

Figures (7)