Table of Contents
Fetching ...

Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence

Ruizhuo Xu, Linzhi Huang, Mei Wang, Jiani Hu, Weihong Deng

TL;DR

Skeleton2vec tackles the suboptimality of predicting low-level features in self-supervised skeleton action pretraining by introducing contextualized target representations generated by a transformer-based teacher. The framework uses a gated teacher-student setup with an asymmetric decoder and a data2vec-inspired loss, combined with motion-aware tube masking to enforce long-range spatiotemporal understanding. Empirical results on NTU-60, NTU-120, and PKU-MMD show state-of-the-art performance across linear, fine-tuning, semi-supervised, and transfer learning protocols, validating the effectiveness of contextualized targets and motion priors. This approach offers a practical and scalable path to robust 3D action representations for skeleton-based recognition tasks, with clear benefits over prior MAE-like methods.

Abstract

Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the teacher encoder can incorporate the global context of the entire training samples, leading to a richer training task. Additionally, considering the high temporal correlations in skeleton sequences, we propose a motion-aware tube masking strategy which divides the skeleton sequence into several tubes and performs persistent masking within each tube based on motion priors, thus forcing the model to build long-range spatio-temporal connections and focus on action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.

Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence

TL;DR

Skeleton2vec tackles the suboptimality of predicting low-level features in self-supervised skeleton action pretraining by introducing contextualized target representations generated by a transformer-based teacher. The framework uses a gated teacher-student setup with an asymmetric decoder and a data2vec-inspired loss, combined with motion-aware tube masking to enforce long-range spatiotemporal understanding. Empirical results on NTU-60, NTU-120, and PKU-MMD show state-of-the-art performance across linear, fine-tuning, semi-supervised, and transfer learning protocols, validating the effectiveness of contextualized targets and motion priors. This approach offers a practical and scalable path to robust 3D action representations for skeleton-based recognition tasks, with clear benefits over prior MAE-like methods.

Abstract

Self-supervised pre-training paradigms have been extensively explored in the field of skeleton-based action recognition. In particular, methods based on masked prediction have pushed the performance of pre-training to a new height. However, these methods take low-level features, such as raw joint coordinates or temporal motion, as prediction targets for the masked regions, which is suboptimal. In this paper, we show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework, which utilizes a transformer-based teacher encoder taking unmasked training samples as input to create latent contextualized representations as prediction targets. Benefiting from the self-attention mechanism, the latent representations generated by the teacher encoder can incorporate the global context of the entire training samples, leading to a richer training task. Additionally, considering the high temporal correlations in skeleton sequences, we propose a motion-aware tube masking strategy which divides the skeleton sequence into several tubes and performs persistent masking within each tube based on motion priors, thus forcing the model to build long-range spatio-temporal connections and focus on action-semantic richer regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets demonstrate that our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
Paper Structure (15 sections, 11 equations, 4 figures, 6 tables)

This paper contains 15 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A comparative illustration of the prediction targets between MAE-like methods (a) and ours Skeleton2vec (b). Skeleton2vec utilizes an teacher encoder $f(x)$ to generate globally contextualized representations as the prediction targets, instead of isolated joints or temporal motion with only local context.
  • Figure 2: The overall pipeline of the proposed Skeleton2vec framework. We adopt the motion-aware tube masking strategy (a) to guide the masking process, which prevents information leakage between adjacent frames and allows the model to focus more on semantically rich regions of motion. Subsequently, the teacher encoder $E_{\Delta}$ receives unmasked samples to construct latent contextualized targets, while the student encoder $E_{\theta}$ receives masked versions of the samples and predicts corresponding representations at the masked positions.
  • Figure 3: Ablation study on the EMA parameter $\tau_{0}$. The results are reported on the NTU-60 XSub dataset under the linear protocol.
  • Figure 4: Ablation study on the tube length. $\alpha=0$ is equivalent to random masking, while $\alpha=30$, which is the length of the input sequence, is equivalent to single-tube masking.