Table of Contents
Fetching ...

Representing Signs as Signs: One-Shot ISLR to Facilitate Functional Sign Language Technologies

Toon Vandendriessche, Mathieu De Coster, Annelies Lejon, Joni Dambre

TL;DR

The paper tackles the scalability challenge of ISLR across languages and evolving vocabularies by learning language-independent sign embeddings through pretraining and performing one-shot recognition via dense vector search. Using PoseFormer with keypoint inputs, it achieves state-of-the-art performance on ASL_Citizen and demonstrates strong cross-language generalization to large dictionaries (e.g., 10,235 signs) with a one-shot MRR of $0.508$. These findings show that sign representations, rather than translations, enable robust, scalable ISLR, even as vocabularies grow and signing contexts vary. The work was co-created with the Deaf and Hard of Hearing community and culminated in a publicly available dictionary lookup tool, highlighting practical impact for DHH users.

Abstract

Isolated Sign Language Recognition (ISLR) is crucial for scalable sign language technology, yet language-specific approaches limit current models. To address this, we propose a one-shot learning approach that generalises across languages and evolving vocabularies. Our method involves pretraining a model to embed signs based on essential features and using a dense vector search for rapid, accurate recognition of unseen signs. We achieve state-of-the-art results, including 50.8% one-shot MRR on a large dictionary containing 10,235 unique signs from a different language than the training set. Our approach is robust across languages and support sets, offering a scalable, adaptable solution for ISLR. Co-created with the Deaf and Hard of Hearing (DHH) community, this method aligns with real-world needs, and advances scalable sign language recognition.

Representing Signs as Signs: One-Shot ISLR to Facilitate Functional Sign Language Technologies

TL;DR

The paper tackles the scalability challenge of ISLR across languages and evolving vocabularies by learning language-independent sign embeddings through pretraining and performing one-shot recognition via dense vector search. Using PoseFormer with keypoint inputs, it achieves state-of-the-art performance on ASL_Citizen and demonstrates strong cross-language generalization to large dictionaries (e.g., 10,235 signs) with a one-shot MRR of . These findings show that sign representations, rather than translations, enable robust, scalable ISLR, even as vocabularies grow and signing contexts vary. The work was co-created with the Deaf and Hard of Hearing community and culminated in a publicly available dictionary lookup tool, highlighting practical impact for DHH users.

Abstract

Isolated Sign Language Recognition (ISLR) is crucial for scalable sign language technology, yet language-specific approaches limit current models. To address this, we propose a one-shot learning approach that generalises across languages and evolving vocabularies. Our method involves pretraining a model to embed signs based on essential features and using a dense vector search for rapid, accurate recognition of unseen signs. We achieve state-of-the-art results, including 50.8% one-shot MRR on a large dictionary containing 10,235 unique signs from a different language than the training set. Our approach is robust across languages and support sets, offering a scalable, adaptable solution for ISLR. Co-created with the Deaf and Hard of Hearing (DHH) community, this method aligns with real-world needs, and advances scalable sign language recognition.

Paper Structure

This paper contains 19 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We perform one-shot sign classification to search through a dictionary with a query video. Solid arrows: a sign language dictionary is mapped to a support set of embeddings by an SLR model. This is done once. Dashed arrows: we can classify a new example (query the dictionary) by also mapping the example to an embedding with the same model and using attention to obtain probabilities for every label in the support set. This can be done without regenerating the support set.
  • Figure 2: Class distributions of pretraining datasets, sorted by descending sample count.
  • Figure 3: The number of dictionary queries per gloss is distributed approximately uniformly with mean 11.93. Each label on the horizontal axis is a link to the corresponding video within the Flemish Sign Language dictionary.
  • Figure 4: The PoseFormer model, represented by solid lines, consists of several blocks. The 1D convolutions process data along the temporal axis, while the frame embedding block handles individual frames. Finally, the multi-head attention block extracts relevant features. After training, the classification head, consisting of a single linear layer (depicted with a dashed line), is removed.
  • Figure 5: Metrics for one-shot evaluation