Table of Contents
Fetching ...

SSL-SLR: Self-Supervised Representation Learning for Sign Language Recognition

Ariel Basso Madjoukeng, Jérôme Fink, Pierre Poitier, Edith Belise Kenmogne, Benoit Frenay

TL;DR

A self-supervised learning framework designed to learn meaningful representations for sign language recognition that shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

Abstract

Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

SSL-SLR: Self-Supervised Representation Learning for Sign Language Recognition

TL;DR

A self-supervised learning framework designed to learn meaningful representations for sign language recognition that shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

Abstract

Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

Paper Structure

This paper contains 16 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of sign steps across three different datasets: LSA ronchetti2023lsa64 (first row), ASL desai2023asl (second row), and LSFB fink2021lsfb (third row). For privacy reasons, signers' faces have been blurred.
  • Figure 2: SL-FPN architecture: A sign and its augmented variants are passed through an encoder. SL-FPN optimizes three objectives: (1) minimizing the distance between representations of the two augmented variants; (2) minimizing the distance between representations of one augmented variant and the original instance; and (3) minimizing the distance between representations of the original sample and the other augmented variant using a predictor with a stop-gradient operator.
  • Figure 3: Variation of the accuracy during linear evaluation protocol on the LSFB (\ref{['left_lsfb']}, \ref{['rigth_gsl']}) and GSL (\ref{['left_gsl']},\ref{['right_lsfb']}) dataset based on the number of $k$ shuffled frames at the beginning (from the left to the right (\ref{['left_lsfb']} ,\ref{['left_gsl']}) and at the end (from the right to the left (\ref{['right_lsfb']}, \ref{['rigth_gsl']}).
  • Figure 4: Histogram of class proportions as a function of permutations starting from the first frames (a) and from the last frames (b).
  • Figure 5: Impact of the global choice ($k_s^*$ and $k_e^*$) on the failure cases.
  • ...and 2 more figures