Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment
Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, Gül Varol, Andrew Zisserman
TL;DR
The paper introduces a unified sign-language understanding model that performs SLT and SSA using a lightweight visual backbone (pose keypoints + lip regions) and a Sliding Perceiver to bridge visual features to an LLM. It leverages multilingual pretraining on BSL and ASL data to achieve state-of-the-art results on the BOBSL dataset for both SLT and SSA, with strong zero-shot performance on How2Sign and FLEURS-ASL. The approach emphasizes end-to-end trainability, privacy-preserving representations, and cross-language generalization, aided by a three-stage curriculum and DoRA adapters. Empirically, it delivers superior translation and alignment performance while maintaining efficiency compared with larger Video-Swin-based ISLR systems, highlighting practical potential for scalable sign-language translation and education.
Abstract
Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.
