Table of Contents
Fetching ...

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

TL;DR

This work tackles the challenge of aligning spoken subtitles to continuous sign-language videos across languages and domains. It introduces SEA, a modular pipeline that segments signing, embeds signs and subtitles into a shared latent space, and performs a global dynamic-programming alignment, optionally incorporating semantic similarity. SEA achieves state-of-the-art alignment across four datasets and three sign languages, highlighting the value of modular design and language-specific embedding fine-tuning for cross-lingual sign-language processing. The approach enables scalable creation of high-quality parallel data for sign-language translation and related tasks, with public code and models to foster reproducibility and future improvements.

Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

TL;DR

This work tackles the challenge of aligning spoken subtitles to continuous sign-language videos across languages and domains. It introduces SEA, a modular pipeline that segments signing, embeds signs and subtitles into a shared latent space, and performs a global dynamic-programming alignment, optionally incorporating semantic similarity. SEA achieves state-of-the-art alignment across four datasets and three sign languages, highlighting the value of modular design and language-specific embedding fine-tuning for cross-lingual sign-language processing. The approach enables scalable creation of high-quality parallel data for sign-language translation and related tasks, with public code and models to foster reproducibility and future improvements.

Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Paper Structure

This paper contains 39 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Aligning subtitles to continuous signing: In broadcast interpreting and other sign language corpora, original subtitles (green) frequently lag or lead the actual signing (yellow) by non-deterministic amounts. Our alignment method, SEA, produces time-corrected subtitles (blue) that better correspond to the signed content. Keyframes are sampled at the midpoint of each span and may not include all annotated signs present in that interval.
  • Figure 2: Our method SEA consists of three modular steps: (1) Segment video frames of continuous signing into individual signs; (2) Embedd each sign ($s_1$ to $s_n$; 2a) and subtitle unit ($t_1$ to $t_n$; 2b) into a shared latent space, with their dot product similarities encoded by a similarity matrix; (3) Align subtitles to signing based on the text-sign similarities and the original temporal location of the subtitle units. The similarity matrix is illustrated as a heatmap over time, with darker bars indicating a higher similarity between a sign and a subtitle; similarities of the originally remote signs are zeroed out to ensure locality. The dashed/solid boxes in the heatmap indicate the predicted/manually aligned subtitle locations, respectively.
  • Figure 3: Qualitative results: For each dataset, we sample a 30-second validation clip and show keyframes every 2 seconds. Rows: predicted signs from the segmentation model (yellow), original subtitles (green), SEA-aligned subtitles (blue), and expert-aligned ground truth (purple). In general, segmentation identifies the signing frames of interest; SEA then shifts subtitles toward higher text–sign similarity—for example, in BSL the first subtitle begins at the fingerspelling AMPNIBIS ("amphibians") and ends at the sign for "sliminess".
  • Figure 4: Python-like pseudocode for SEA: segment, embed, and align using DP with a weighted cost function, followed by refining spans and updating subtitles.
  • Figure 5: Aligning subtitles to continuous signing using SEA: From top down, signs are first segmented, then embedded (colored). From bottom up, subtitles are embedded into the same latent space as signs and then aligned accordingly to the direction of similarities.