Table of Contents
Fetching ...

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Youngjoon Jang, Haran Raajesh, Liliane Momeni, Gül Varol, Andrew Zisserman

TL;DR

This paper tackles open-vocabulary sign language translation by integrating multiple contextual cues—background scene descriptions, previous sentence translations, and sign-level pseudo-glosses—with visual signing features in a pre-trained large language model. The authors propose a mapping network from Video-Swin features to LLM tokens and fine-tune Llama3-8B using LoRA, augmented with training-time stimuli like word and cue dropouts. Through extensive ablations on the BOBSL dataset and evaluation on How2Sign, the method consistently outperforms baselines and state-of-the-art approaches, demonstrating the value of context for disambiguation, referent resolution, and tense handling. The work highlights practical gains in SLT performance and generalizes across datasets, while also acknowledging limitations such as potential noise from background context and error propagation from previous translations, pointing to future improvements in robustness and reliability.

Abstract

Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

TL;DR

This paper tackles open-vocabulary sign language translation by integrating multiple contextual cues—background scene descriptions, previous sentence translations, and sign-level pseudo-glosses—with visual signing features in a pre-trained large language model. The authors propose a mapping network from Video-Swin features to LLM tokens and fine-tune Llama3-8B using LoRA, augmented with training-time stimuli like word and cue dropouts. Through extensive ablations on the BOBSL dataset and evaluation on How2Sign, the method consistently outperforms baselines and state-of-the-art approaches, demonstrating the value of context for disambiguation, referent resolution, and tense handling. The work highlights practical gains in SLT performance and generalizes across datasets, while also acknowledging limitations such as potential noise from background context and error propagation from previous translations, pointing to future improvements in robustness and reliability.

Abstract

Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
Paper Structure (30 sections, 10 figures, 16 tables)

This paper contains 30 sections, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Contextual cues in SLT: In addition to information extracted from the signing content (at the bottom right corner), we give the sign language translation model two contextual cues: the background description that identifies keywords describing the scene behind the signer, and the previous sentence translations. In this example, the ground truth (GT) translation has common words or semantics with the background context (e.g., flower), and the previous sentence (e.g., wind).
  • Figure 2: Method overview: The input prompt combines contextual cues, the background descriptions and previous sentences, with the information from the current video sequence, specifically visual features and pseudo-glosses. Visual features corresponding to the signer $\{V\}$ are extracted using a pre-trained Video-Swin model, which are projected to text space with a learnable mapping network. We obtain pseudo-glosses $\{P\}$ by passing the Video-Swin features through the pre-trained ISLR Classifier (*Video-Swin here denotes the layers except the last one of the ISLR model). The background captions, obtained from an off-the-shelf image captioner, are summarised into a list of keywords, which we refer to as background descriptions. During training, we randomly sample previous GT sentences and previous predictions, while during inference, the model uses its previous prediction in an auto-regressive manner. In practice, we include prompts that instruct the model on sign language translation and describe each input. We supervise the predicted translation output by comparing it against the ground truth, e.g., 'As pagans, the Romans worshipped many gods and spirits' in this example. Note that we do not use ground-truth glosses -- they are displayed on the bottom right (e.g. roman, many...) only for illustration.
  • Figure 3: Qualitative analysis: We present visual examples to show how different cues affect the translation results. Starting with visual features, we incrementally add pseudo-glosses (PG), the predicted previous sentence (Prev), and the background description (BG). We observe that the previous sentence helps translation performance by providing further context (top left, bottom right). The background description also helps for pronoun referencing (top right), place names (middle left), pointing gestures (middle right), and object referencing (bottom left). However, the background can also in some cases hinder translation (bottom right). We refer to \ref{['subsec:qualitative']} for detailed comments.
  • Figure A.1: LLM evaluation prompt: We provide the input format that we feed to GPT-4o-minigpt4 to evaluate the quality of the translated sentence (text_pred) by asking the LLM to compare it against the ground truth sentence (text_gt). Specifically, we design a system prompt to define the task, and a series of user-assistant prompt pairs to provide input-output examples for calibration. The last user prompt includes the translated sentence to be evaluated. Instructions are repeated at each user prompt. Here, we display only one example (enclosed in between # comment lines to facilitate the reading). In practice, we provide 12 in-context examples, which are listed in \ref{['tab:app:incontext']}, and the full prompt can be found in the code release.
  • Figure A.2: Correlation between human judgement and evaluation metrics: We plot the scores obtained via human evaluation ('Manual') against the LLM evaluation scores and the standard captioning metrics (BLEU, ROUGE, and BLEURT). We observe that the LLM score correlates the most with human judgement.
  • ...and 5 more figures