STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban; Mohammad Javad Rajabi; Andrea Iaboni; Babak Taati

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Soroush Mehraban, Mohammad Javad Rajabi, Andrea Iaboni, Babak Taati

TL;DR

STARS addresses the limited inter-action discriminability of masked-prediction pretraining for skeleton-based 3D action recognition by coupling MAE-style pretraining with a brief NNCLR-based contrastive tuning that partially updates the encoder. The two-stage design yields well-separated action clusters without hand-crafted augmentations and achieves state-of-the-art self-supervised results on NTU-60, NTU-120, and PKU-MMD, while significantly improving few-shot performance. The approach combines a motion-aware masking MAE stage with a nearest-neighbor contrastive objective, demonstrating that selective encoder tuning can preserve generalization while enhancing cluster structure. Overall, STARS provides a practical, efficient pathway to strong self-supervised representations for skeleton-based action recognition and reliable transfer to few-shot regimes.

Abstract

Self-supervised pretraining methods with masked prediction demonstrate remarkable within-dataset performance in skeleton-based action recognition. However, we show that, unlike contrastive learning approaches, they do not produce well-separated clusters. Additionally, these methods struggle with generalization in few-shot settings. To address these issues, we propose Self-supervised Tuning for 3D Action Recognition in Skeleton sequences (STARS). Specifically, STARS first uses a masked prediction stage using an encoder-decoder architecture. It then employs nearest-neighbor contrastive learning to partially tune the weights of the encoder, enhancing the formation of semantic clusters for different actions. By tuning the encoder for a few epochs, and without using hand-crafted data augmentations, STARS achieves state-of-the-art self-supervised results in various benchmarks, including NTU-60, NTU-120, and PKU-MMD. In addition, STARS exhibits significantly better results than masked prediction models in few-shot settings, where the model has not seen the actions throughout pretraining. Project page: https://soroushmehraban.github.io/stars/

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

TL;DR

Abstract

Paper Structure (20 sections, 7 equations, 7 figures, 18 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 7 figures, 18 tables, 1 algorithm.

Introduction
Related Works
Self-supervised Skeleton-Based Action Recognition
Combining Masked Autoencoders with Instance Discrimination
Method
Framework Overview
MAMP Pre-training (Stage 1)
Contrastive tuning (Stage 2)
Experiments
Datasets
Experimental Setup
Evaluation and Comparison
Ablation Study
Conclusion
3-Stage Design
...and 5 more sections

Figures (7)

Figure 1: Comparison between training time and test-time accuracy on linear evaluation protocol. Training time is evaluated on a single NVIDIA GeForce RTX 3090 GPU.
Figure 2: The overall pipeline of our proposed STARS framework. The first stage uses MAMP mao2023mamp to reconstruct the motion of masked tokens. The second stage trains parameters of the projector and predictor using a contrastive learning approach in addition to partially tuning the encoder weights.
Figure 3: The t-SNE visualization of embedding features. We sample 15 action classes from the NTU-60 dataset and visualize the features extracted by our proposed STARS framework and compare it with AimCLR guo2022contrastive, CMD mao2022cmd, HiCo-Transformer dong2023hierarchical, and MAMP mao2023mamp.
Figure 4: Ablation study on (a) layer-wise learning decay (b) Queue size. The performance is evaluated on the NTU-60 XSub dataset under the KNN evaluation protocol (K=10).
Figure 5: The overall pipeline of our proposed STARS-3stage framework. The first stage uses MAMP mao2023mamp to reconstruct the motion of masked tokens. The second stage keeps the encoder parameters frozen and trains parameters of the projector and predictor using a contrastive learning approach. After these parameters have converged to well-separated clusters, the third stage involves partial-tuning of the encoder parameters.
...and 2 more figures

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

TL;DR

Abstract

STARS: Self-supervised Tuning for 3D Action Recognition in Skeleton Sequences

Authors

TL;DR

Abstract

Table of Contents

Figures (7)