Table of Contents
Fetching ...

SignRep: Enhancing Self-Supervised Sign Representations

Ryan Wong, Necati Cihan Camgoz, Richard Bowden

TL;DR

This work presents SignRep, a scalable self-supervised framework for sign language representation learning that relies on masked autoencoding guided by sign priors derived from pose information. By removing the need for skeletal keypoints at inference and using a single RGB modality, SignRep achieves state-of-the-art recognition on WLASL, ASL-Citizen, and NMFs-CSL, while also excelling at sign dictionary retrieval and serving effectively as a feature extractor for sign translation. The approach combines priors reconstruction, variance-covariance regularization, and an adversarial style loss to produce robust, sign-centric representations that generalize to unseen data. Its practical impact lies in enabling efficient, scalable sign-language modeling with reduced computational costs and without heavy multimodal architectures.

Abstract

Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.

SignRep: Enhancing Self-Supervised Sign Representations

TL;DR

This work presents SignRep, a scalable self-supervised framework for sign language representation learning that relies on masked autoencoding guided by sign priors derived from pose information. By removing the need for skeletal keypoints at inference and using a single RGB modality, SignRep achieves state-of-the-art recognition on WLASL, ASL-Citizen, and NMFs-CSL, while also excelling at sign dictionary retrieval and serving effectively as a feature extractor for sign translation. The approach combines priors reconstruction, variance-covariance regularization, and an adversarial style loss to produce robust, sign-centric representations that generalize to unseen data. Its practical impact lies in enabling efficient, scalable sign-language modeling with reduced computational costs and without heavy multimodal architectures.

Abstract

Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (Left): The pretraining process for SignRep, which leverages masked representation learning to predict sign priors such as hand keypoints and joint angles. This is achieved through a Hiera encoder and a lightweight sign decoder. The representation is further refined with regularization losses, including variance, covariance and adversarial style loss. (Right): An example setup for the discriminator to obtain a representation pair to predict a style-representation match.
  • Figure 2: Visualization of 3D keypoint extracted. Numbers alongside the nodes represent the keypoint indices. For visualization purposes, we separate the left and right hand from the body.
  • Figure 3: Qualitative results for ASL-Citizen for retrieval based on features extracted from the pretrained SignRep. Given the reference sequence (Ref.), the Top-3 most similar videos are retrieved based on the cosine similarity of the output representations. M1 denotes the closest match, M2 is the second closest match and M3 is the third closest match.
  • Figure 4: Qualitative results for NMFs-CSL for retrieval based on features extracted from the pretrained SignRep. Given the reference sequence (Ref.), the Top-3 most similar videos are retrieved based on the cosine similarity of the output representations. M1 denotes the closest match, M2 is the second closest match and M3 is the third closest match.
  • Figure 5: Qualitative results for WLASL for retrieval based on features extracted from the pretrained SignRep. Given the reference sequence (Ref.), the Top-3 most similar videos are retrieved based on the cosine similarity of the output representations. M1 denotes the closest match, M2 is the second closest match and M3 is the third closest match.