Table of Contents
Fetching ...

Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition

Chuankun Li, Shuai Li, Yanbo Gao, Ping Chen, Jian Li, Wanqing Li

TL;DR

This work tackles overfitting in unsupervised skeleton-based action recognition by revealing a misalignment between skeleton-level features and action-recognition manifolds. It introduces U-FEFP, a framework that jointly performs spatial-temporal feature enrichment via a lightweight ST-GCN+GConv-GRU backbone with BYOL-based contrastive learning and fidelity preservation through a reversed-prediction pretext task. The approach yields rich, discriminative representations while maintaining fidelity to the original skeleton information, and empirically outperforms prior unsupervised methods on NTU-60, NTU-120, and PKU-MMD, with supportive t-SNE visualizations. The results suggest that combining feature enrichment with sequence fidelity constraints can substantially elevate unsupervised skeleton action recognition and enable strong linear/semi-supervised performance. This advances practical unsupervised learning for skeleton-based action understanding by addressing core overfitting issues with a principled feature-design and training strategy.

Abstract

Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition.

Unsupervised Spatial-Temporal Feature Enrichment and Fidelity Preservation Network for Skeleton based Action Recognition

TL;DR

This work tackles overfitting in unsupervised skeleton-based action recognition by revealing a misalignment between skeleton-level features and action-recognition manifolds. It introduces U-FEFP, a framework that jointly performs spatial-temporal feature enrichment via a lightweight ST-GCN+GConv-GRU backbone with BYOL-based contrastive learning and fidelity preservation through a reversed-prediction pretext task. The approach yields rich, discriminative representations while maintaining fidelity to the original skeleton information, and empirically outperforms prior unsupervised methods on NTU-60, NTU-120, and PKU-MMD, with supportive t-SNE visualizations. The results suggest that combining feature enrichment with sequence fidelity constraints can substantially elevate unsupervised skeleton action recognition and enable strong linear/semi-supervised performance. This advances practical unsupervised learning for skeleton-based action understanding by addressing core overfitting issues with a principled feature-design and training strategy.

Abstract

Unsupervised skeleton based action recognition has achieved remarkable progress recently. Existing unsupervised learning methods suffer from severe overfitting problem, and thus small networks are used, significantly reducing the representation capability. To address this problem, the overfitting mechanism behind the unsupervised learning for skeleton based action recognition is first investigated. It is observed that the skeleton is already a relatively high-level and low-dimension feature, but not in the same manifold as the features for action recognition. Simply applying the existing unsupervised learning method may tend to produce features that discriminate the different samples instead of action classes, resulting in the overfitting problem. To solve this problem, this paper presents an Unsupervised spatial-temporal Feature Enrichment and Fidelity Preservation framework (U-FEFP) to generate rich distributed features that contain all the information of the skeleton sequence. A spatial-temporal feature transformation subnetwork is developed using spatial-temporal graph convolutional network and graph convolutional gate recurrent unit network as the basic feature extraction network. The unsupervised Bootstrap Your Own Latent based learning is used to generate rich distributed features and the unsupervised pretext task based learning is used to preserve the information of the skeleton sequence. The two unsupervised learning ways are collaborated as U-FEFP to produce robust and discriminative representations. Experimental results on three widely used benchmarks, namely NTU-RGB+D-60, NTU-RGB+D-120 and PKU-MMD dataset, demonstrate that the proposed U-FEFP achieves the best performance compared with the state-of-the-art unsupervised learning methods. t-SNE illustrations further validate that U-FEFP can learn more discriminative features for unsupervised skeleton based action recognition.
Paper Structure (21 sections, 5 equations, 7 figures, 8 tables)

This paper contains 21 sections, 5 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: t-SNE visualization of the learned features of different methods on the cross-subject of NTU-RGB+D-60. 60 samples are selected for each class on the dataset. (a) Unsupervised learning based on pretext task, P&C Su2020PREDICTC. (b) Unsupervised contrastive learning with the momentum LSTM, ASCAL Rao2021AugmentedSB. (c) Unsupervised contrastive learning with the adaptive GCN Shi2019TwoStreamAG. (d) Proposed U-FEFP. (e) Supervised learning with the adaptive GCN Shi2019TwoStreamAG.
  • Figure 2: The framework of the proposed U-FEFP, with unsupervised BYOL based feature enrichment learning and unsupervised pretext task based fidelity preservation learning. It consists of an online network (in green), a target network (in blue) and a reversed prediction network (in beige). The online network is trained to learn rich representations and the target network is slowly updated by the exponential moving average of the online network to make them asynchronous. The reversed prediction network is used to reconstruct the skeleton sequence with the features generated by the online network. The BYOL based contrastive learning (within the black dash box) and the reversed prediction (pretext task) based learning (within the red dash box) are used to keep similarity of different skeleton augmentations at feature and instance level, respectively.
  • Figure 3: The structure of the GConv-GRU
  • Figure 4: The structure of the decoder in the unsupervised pretext task based learning.
  • Figure 5: Comparisons of the proposed U-FEFP and 2s-AGCN in the process of unsupervised pre-training (a) and fine-tuning/test in the linear evaluation protocol (b), respectively.
  • ...and 2 more figures