Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
Youngjoon Jang, Haran Raajesh, Liliane Momeni, Gül Varol, Andrew Zisserman
TL;DR
This paper tackles open-vocabulary sign language translation by integrating multiple contextual cues—background scene descriptions, previous sentence translations, and sign-level pseudo-glosses—with visual signing features in a pre-trained large language model. The authors propose a mapping network from Video-Swin features to LLM tokens and fine-tune Llama3-8B using LoRA, augmented with training-time stimuli like word and cue dropouts. Through extensive ablations on the BOBSL dataset and evaluation on How2Sign, the method consistently outperforms baselines and state-of-the-art approaches, demonstrating the value of context for disambiguation, referent resolution, and tense handling. The work highlights practical gains in SLT performance and generalizes across datasets, while also acknowledging limitations such as potential noise from background context and error propagation from previous translations, pointing to future improvements in robustness and reliability.
Abstract
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
