Table of Contents
Fetching ...

Deep self-supervised learning with visualisation for automatic gesture recognition

Fabien Allemand, Alessio Mazzela, Jun Villette, Decky Aspandi, Titus Zaharia

TL;DR

This work investigates automatic gesture recognition from 3D skeleton data by comparing supervised learning across FC, CNN, and LSTM architectures, with an emphasis on self-supervised pretraining via reconstruction on unlabelled sequences. It demonstrates that CNN and LSTM achieve near-perfect accuracy on a binary Mono/Bi gesture task, while self-supervised pretraining can boost performance when labelled data are limited. Grad-CAM analyses reveal that models focus on relevant moving joints for single-hand gestures but show interpretability gaps for two-handed gestures, guiding future refinements. Overall, the study highlights the potential of self-supervised strategies to utilize unlabelled data and the value of visualization techniques for understanding gesture recognition models, with implications for sign language, VR, and human–computer interaction systems.

Abstract

Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data. Self-supervised learning used to train fully connected, CNN and LSTM method. Then, reconstruction method is applied to unlabelled data in simulated settings using CNN as a backbone where we use the learnt features to perform the prediction in the remaining labelled data. Lastly, Grad-CAM is applied to discover the focus of the models. Our experiments results show that supervised learning method is capable to recognise gesture accurately, with self-supervised learning increasing the accuracy in simulated settings. Finally, Grad-CAM visualisation shows that indeed the models focus on relevant skeleton joints on the associated gesture.

Deep self-supervised learning with visualisation for automatic gesture recognition

TL;DR

This work investigates automatic gesture recognition from 3D skeleton data by comparing supervised learning across FC, CNN, and LSTM architectures, with an emphasis on self-supervised pretraining via reconstruction on unlabelled sequences. It demonstrates that CNN and LSTM achieve near-perfect accuracy on a binary Mono/Bi gesture task, while self-supervised pretraining can boost performance when labelled data are limited. Grad-CAM analyses reveal that models focus on relevant moving joints for single-hand gestures but show interpretability gaps for two-handed gestures, guiding future refinements. Overall, the study highlights the potential of self-supervised strategies to utilize unlabelled data and the value of visualization techniques for understanding gesture recognition models, with implications for sign language, VR, and human–computer interaction systems.

Abstract

Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data. Self-supervised learning used to train fully connected, CNN and LSTM method. Then, reconstruction method is applied to unlabelled data in simulated settings using CNN as a backbone where we use the learnt features to perform the prediction in the remaining labelled data. Lastly, Grad-CAM is applied to discover the focus of the models. Our experiments results show that supervised learning method is capable to recognise gesture accurately, with self-supervised learning increasing the accuracy in simulated settings. Finally, Grad-CAM visualisation shows that indeed the models focus on relevant skeleton joints on the associated gesture.
Paper Structure (15 sections, 14 figures, 3 tables)

This paper contains 15 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Several applications of gesture recognition.
  • Figure 2: Deep learning models used for sign language recognition.
  • Figure 3: Supervised learning process for class prediction.
  • Figure 4: Unsupervised learning process for input reconstruction.
  • Figure 5: Overview of self-supervised learning for class prediction.
  • ...and 9 more figures