YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus
Garrett Tanzer, Biao Zhang
TL;DR
This work introduces YouTube-SL-25, a large-scale open-domain multilingual sign-language corpus with over $3000$ hours across more than $25$ sign languages, assembled from YouTube through automatic candidate retrieval and channel-level triage. It extends a unified T5-based multilingual framework to support multiple source/target languages and an integrated sign-language identification task, reporting baselines for four sign languages that show multilingual transfer benefits for both high- and low-resource languages. The dataset is larger than previous open parallel sign-language resources and is publicly released as video IDs to enable continued research, though the authors acknowledge limitations in coverage, representativeness, and the need for robust preprocessing and fairness analyses. The work demonstrates the potential of weakly supervised, multilingual pretraining to advance sign-language translation and related tasks such as caption alignment and sign-language identification, highlighting practical pathways toward more inclusive Deaf/Hard of Hearing technologies.
Abstract
Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.
