Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang; Youngjoon Jang; Liliane Momeni; Gül Varol; Sarah Ebling; Andrew Zisserman

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

TL;DR

This work tackles the challenge of aligning spoken subtitles to continuous sign-language videos across languages and domains. It introduces SEA, a modular pipeline that segments signing, embeds signs and subtitles into a shared latent space, and performs a global dynamic-programming alignment, optionally incorporating semantic similarity. SEA achieves state-of-the-art alignment across four datasets and three sign languages, highlighting the value of modular design and language-specific embedding fine-tuning for cross-lingual sign-language processing. The approach enables scalable creation of high-quality parallel data for sign-language translation and related tasks, with public code and models to foster reproducibility and future improvements.

Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

TL;DR

Abstract

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)