Table of Contents
Fetching ...

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, Gül Varol, Andrew Zisserman

TL;DR

The paper introduces a unified sign-language understanding model that performs SLT and SSA using a lightweight visual backbone (pose keypoints + lip regions) and a Sliding Perceiver to bridge visual features to an LLM. It leverages multilingual pretraining on BSL and ASL data to achieve state-of-the-art results on the BOBSL dataset for both SLT and SSA, with strong zero-shot performance on How2Sign and FLEURS-ASL. The approach emphasizes end-to-end trainability, privacy-preserving representations, and cross-language generalization, aided by a three-stage curriculum and DoRA adapters. Empirically, it delivers superior translation and alignment performance while maintaining efficiency compared with larger Video-Swin-based ISLR systems, highlighting practical potential for scalable sign-language translation and education.

Abstract

Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.

Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

TL;DR

The paper introduces a unified sign-language understanding model that performs SLT and SSA using a lightweight visual backbone (pose keypoints + lip regions) and a Sliding Perceiver to bridge visual features to an LLM. It leverages multilingual pretraining on BSL and ASL data to achieve state-of-the-art results on the BOBSL dataset for both SLT and SSA, with strong zero-shot performance on How2Sign and FLEURS-ASL. The approach emphasizes end-to-end trainability, privacy-preserving representations, and cross-language generalization, aided by a three-stage curriculum and DoRA adapters. Empirically, it delivers superior translation and alignment performance while maintaining efficiency compared with larger Video-Swin-based ISLR systems, highlighting practical potential for scalable sign-language translation and education.

Abstract

Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles -- both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.

Paper Structure

This paper contains 32 sections, 8 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: A unified sign language understanding model. Given signing data, our model performs both SLT and SSA, guided by textual prompts. For both tasks, a 500-frame (20s at 25 fps) video is used as input. In SLT mode, the model receives the sign video with frame-level timestamps specifying the region of interest (not shown for clarity), and generates a spoken language translation for that segment. In SSA mode, the model takes the sign video, a target sentence along with its audio-aligned timestamps (if available), and predicts the timestamps where the sentence is signed.
  • Figure 2: Method overview. Given a 20-second long sign language video, our model extracts two complementary modalities: (i) holistic body keypoints and (ii) grayscale lip-region sequences. Each modality is processed by its respective backbone using a sliding-window (window size = $24$, stride = $2$) approach to produce temporally dense pose and lip features, denoted as $f_p$ and $f_l$. These features are concatenated along the channel dimension and passed into the Sliding Perceiver, which aggregates local visual information into a compact latent sequence $L'$. The latent representation $L'$ is then inserted within a task-specific prompt (at <SignHere> token) and fed into a pretrained LLM to perform one of two downstream tasks. In SLT mode, the model generates spoken-language text corresponding to the signing video in the region of interest (denoted by frame indices at the {start} and {end} tokens in the prompt). In SSA mode, the model predicts the start and end timestamps, as frame indices, where the provided sentence (inserted at the {sentence} token in the prompt) is being signed. Note that the {lang} field in the SLT prompt specifies the target language using its ISO 639 code.
  • Figure 3: Visual backbone. The model has two streams: pose and lip backbones. The pose backbone processes $F$ frames of 3D keypoints, divided into face, body, and hand articulators, encoding each to produce articulator-specific features ($f_f$, $f_b$, $f_{h\text{-}L}$, $f_{h\text{-}R}$). Up to two of these features are randomly masked during training to improve robustness. The features are then concatenated along the channel dimension, and passed through a Conformer to obtain the pose embedding. The lip backbone also processes $F$ grayscale video frames of the signer lip-region via a frozen pretrained lip reader, followed by a Conformer that outputs the lip embedding.
  • Figure 4: Sliding Perceiver. The input sequence $X$ is linearly projected and combined with learnable temporal positional encodings. The resulting sequence $X'$ is then processed in a sliding-window manner. Each window is passed through $N$ stacked cross-attention and feed-forward layers with a learnable latent vector $L$.
  • Figure A.1: A detailed pose backbone illustration.(i) Main path: Starting from the 3D keypoints, each articulator stream processes its input through an initial fully connected layer ($\mathrm{fc}_{1}$) followed by a 4-layer AGCN. The resulting jointwise features are reshaped and aggregated using $\mathrm{fc}_{2}$ to obtain the articulator-specific representations ($f_f$, $f_b$, $f_{h\text{-}R}$, $f_{h\text{-}L}$). During training, we randomly zero-fill up to two of these four features to mask articulators, after which the features pass through the backbone Conformer and the classifier head to produce the pose logits $\hat{y}_{p}$. (ii) Auxiliary network: The articulator-specific features ($f_f$, $f_b$, $f_{h\text{-}R}$, $f_{h\text{-}L}$) are each processed by a shared-weight auxiliary Conformer, and the resulting features are subsequently passed to the same classifier heads employed in the pose backbone. The classifier produces articulator-specific logits ($\hat{y}_f$, $\hat{y}_b$, $\hat{y}_{h\text{-}R}$, $\hat{y}_{h\text{-}L}$), which are compared against the pseudo-gloss label $y$ using a cross-entropy objective, providing additional supervision during training.
  • ...and 5 more figures