A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

Charles Raude; K R Prajwal; Liliane Momeni; Hannah Bull; Samuel Albanie; Andrew Zisserman; Gül Varol

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

Charles Raude, K R Prajwal, Liliane Momeni, Hannah Bull, Samuel Albanie, Andrew Zisserman, Gül Varol

TL;DR

This work tackles large-vocabulary continuous sign language recognition (CSLR) and sign-language sentence retrieval by introducing CSLR2, a multi-task Transformer that embeds signed and spoken languages into a shared space. It leverages weak and noisy supervision from large datasets and a new sign-level annotated CSLR-Test benchmark to enable dense time-aligned predictions and retrieval in tandem. Joint training with sentence- and sign-level objectives yields mutual improvements for both tasks, surpassing prior state-of-the-art baselines on CSLR and retrieval benchmarks. The approach establishes a scalable framework for sign-language understanding with practical impact on accessibility and searchability of signing content, while outlining directions for expanding vocabulary and non-lexical signing modeling.

Abstract

In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 12 figures, 16 tables)

This paper contains 32 sections, 3 equations, 12 figures, 16 tables.

Introduction
Related Work
Joint Space for Signed and Spoken Languages
Model overview and inference
Training with sentence- and sign-level losses
Sources of supervision
Implementation details
A New CSLR Evaluation Benchmark
Experiments
Data and evaluation protocol
Baselines
Ablation study
Comparison to the state of the art
Qualitative analysis
Conclusion
...and 17 more sections

Figures (12)

Figure 1: CSLR$^{2}$ model: We illustrate our multi-task model that performs both CSLR and sentence Retrieval, thanks to its joint embedding space between signed language and spoken language text.
Figure 2: Method overview: (a) We show a simplified view for our model architecture which consists of both video and text streams. On the video side, features are extracted from a signing video clip $V$ by running $\mathcal{V}_{enc}^\text{Sign}$ in a sliding window fashion and passed through a Transformer model $\mathcal{V}_{enc}^\text{Sent}$. A video embedding $\mathbf{V}$ and sign video embeddings $\{\mathbf{v}_f\}$ are subsequently extracted. On the text side, we input an English subtitle sentence $T$ and sign pseudo-labels $\{t_w\}$ to the text encoder $\mathcal{T}_{enc}$ and obtain sentence and sign text embeddings ($\mathbf{T}$, $\mathbf{\{\mathbf{t}_w\}}$), respectively. While we illustrate only one triplet data point $(V, T, \{t_w\})$, in practice, we operate on a minibatch of triplets, and employ two contrastive losses to jointly train on sentence retrieval $\mathcal{L}_{\text{SentRet}}$ and sign retrieval $\mathcal{L}_{\text{SignRet}}$. (b) For text-to-video retrieval inference, we simply extract a sentence text embedding given a text query, and rank the sentence video embeddings corresponding to gallery videos according to their cosine similarities. (c) For CSLR inference, each sign video embedding is matched to the top-ranked word from a large vocabulary of size 8K. A post-processing strategy is applied on frame-level predictions to produce final outputs. For visibility, we omit linear layers which project embeddings into the learnt joint-space. See Sec. \ref{['subsec:architecture']} for a detailed description of the architecture and inference procedure.
Figure 3: Annotation examples from the CSLR-Test dataset: As well as assigning the English word(s) corresponding to a sign (i.e. 'gloss'), the annotators indicate the type of sign when appropriate. For example, '*P' for pointing, '*FS' for fingerspelling, '*G' for gesture sign.
Figure 4: Qualitative CSLR results: We compare our model's predictions (Pred) against the ground truth (GT), providing examples from several error ranges (sorted by WER). The subtitles displayed below each example are not used by the model. While we observe that our model correctly predicts a large portion of signs, handling both English synonyms as well as sign language polysemy (two visually similar signs with different meanings) makes the CSLR task challenging. Synonyms are depicted with the same color coding, e.g. 'earth' and 'world' in 3rd row, middle.
Figure A.1: Final sign annotations
...and 7 more figures

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

TL;DR

Abstract

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (12)