ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Safwen Naimi; Wassim Bouachir; Guillaume-Alexandre Bilodeau

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau

TL;DR

ReL-SAR tackles the data scarcity of skeleton-based action recognition by combining a lightweight convolutional transformer with BYOL-based self-supervised pre-training. The method integrates a Selection-Permutation strategy to restructure skeletal inputs and leverages a two-stage spatio-temporal encoder to learn robust representations, achieving competitive results on several small datasets with significantly lower computational cost. Key contributions include the BYOL-based skeleton representation learning, the joint optimization of spatial and temporal features, and the demonstrated efficiency gains, making it suitable for deployment on resource-limited devices. Overall, the approach provides a practical, scalable pathway for unsupervised skeleton action recognition that remains competitive with state-of-the-art supervised methods.

Abstract

To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

TL;DR

Abstract

Paper Structure (18 sections, 4 equations, 4 figures, 6 tables)

This paper contains 18 sections, 4 equations, 4 figures, 6 tables.

Introduction
Related Work
Proposed Method
Human Detection and Pose Estimation
Input skeleton pre-processing and Selection-Permutation strategy
Spatio-temporal feature extraction
Bootstrap Your Own Latent (BYOL) for representation learning
Action Classification
Experiments
Datasets and Evaluation Metric
Implementation and Training Details
BYOL pre-training
Fully Supervised baseline training
Evaluating the Effectiveness of Learned Feature Hierarchies
Semi-Supervised Evaluation
...and 3 more sections

Figures (4)

Figure 1: The human pose is a set of 2D connected keypoints corresponding to joints and important anatomic structures. (Left) Output of ViTPose model. (Right) Skeletons after our Selection-Permutation strategy.
Figure 2: Illustration of our Convolutional transformer model for spatio-temporal feature extraction. An input skeleton sequence is first fed to a 1D-ConvNet. The computed spatial embeddings are then fed to a transformer that generates the final spatio-temporal feature embeddings used for classification.
Figure 3: Proposed skeleton-based BYOL for action recognition.
Figure 4: We fix the length $T$ of the input sequence and test the influence of our Selection-Permutation strategy. Action recognition accuracy of fully fine-tuning is visualized.

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

TL;DR

Abstract

ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL

Authors

TL;DR

Abstract

Table of Contents

Figures (4)