ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL
Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau
TL;DR
ReL-SAR tackles the data scarcity of skeleton-based action recognition by combining a lightweight convolutional transformer with BYOL-based self-supervised pre-training. The method integrates a Selection-Permutation strategy to restructure skeletal inputs and leverages a two-stage spatio-temporal encoder to learn robust representations, achieving competitive results on several small datasets with significantly lower computational cost. Key contributions include the BYOL-based skeleton representation learning, the joint optimization of spatial and temporal features, and the demonstrated efficiency gains, making it suitable for deployment on resource-limited devices. Overall, the approach provides a practical, scalable pathway for unsupervised skeleton action recognition that remains competitive with state-of-the-art supervised methods.
Abstract
To extract robust and generalizable skeleton action recognition features, large amounts of well-curated data are typically required, which is a challenging task hindered by annotation and computation costs. Therefore, unsupervised representation learning is of prime importance to leverage unlabeled skeleton data. In this work, we investigate unsupervised representation learning for skeleton action recognition. For this purpose, we designed a lightweight convolutional transformer framework, named ReL-SAR, exploiting the complementarity of convolutional and attention layers for jointly modeling spatial and temporal cues in skeleton sequences. We also use a Selection-Permutation strategy for skeleton joints to ensure more informative descriptions from skeletal data. Finally, we capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data. We achieved very competitive results on limited-size datasets: MCAD, IXMAS, JHMDB, and NW-UCLA, showing the effectiveness of our proposed method against state-of-the-art methods in terms of both performance and computational efficiency. To ensure reproducibility and reusability, the source code including all implementation parameters is provided at: https://github.com/SafwenNaimi/Representation-Learning-for-Skeleton-Action-Recognition-with-Convolutional-Transformers-and-BYOL
