Table of Contents
Fetching ...

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

Garrett Tanzer, Biao Zhang

TL;DR

This work introduces YouTube-SL-25, a large-scale open-domain multilingual sign-language corpus with over $3000$ hours across more than $25$ sign languages, assembled from YouTube through automatic candidate retrieval and channel-level triage. It extends a unified T5-based multilingual framework to support multiple source/target languages and an integrated sign-language identification task, reporting baselines for four sign languages that show multilingual transfer benefits for both high- and low-resource languages. The dataset is larger than previous open parallel sign-language resources and is publicly released as video IDs to enable continued research, though the authors acknowledge limitations in coverage, representativeness, and the need for robust preprocessing and fairness analyses. The work demonstrates the potential of weakly supervised, multilingual pretraining to advance sign-language translation and related tasks such as caption alignment and sign-language identification, highlighting practical pathways toward more inclusive Deaf/Hard of Hearing technologies.

Abstract

Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.

YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus

TL;DR

This work introduces YouTube-SL-25, a large-scale open-domain multilingual sign-language corpus with over hours across more than sign languages, assembled from YouTube through automatic candidate retrieval and channel-level triage. It extends a unified T5-based multilingual framework to support multiple source/target languages and an integrated sign-language identification task, reporting baselines for four sign languages that show multilingual transfer benefits for both high- and low-resource languages. The dataset is larger than previous open parallel sign-language resources and is publicly released as video IDs to enable continued research, though the authors acknowledge limitations in coverage, representativeness, and the need for robust preprocessing and fairness analyses. The work demonstrates the potential of weakly supervised, multilingual pretraining to advance sign-language translation and related tasks such as caption alignment and sign-language identification, highlighting practical pathways toward more inclusive Deaf/Hard of Hearing technologies.

Abstract

Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.
Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: https://commons.wikimedia.org/wiki/File:BlankMap-World-noborders.png showing the amount of content in YouTube-SL-25 for each sign language; the area of each circle is proportional to the number of hours. The circle in the middle of the Atlantic Ocean represents International Sign. Observe that the dataset is especially lacking in representation for Central & South America, Africa, West & Central Asia, and China & Southeast Asia.
  • Figure 2: Demographic representation of YouTube-SL-25 content (proportion of hours), predicted with proprietary classifiers. These predictions should only be interpreted in aggregate.
  • Figure 3: Unified document-level sign-to-text training, extended for multilinguality; modified from Figure 2 of fleursasl. New additions circled in red. For caption alignment, source and target language are provided unconditionally. For translation, source and target language are provided w.p. 0.9 and predicted w.p. 0.1 (only when text context is not included). We avoid conditioning on the target language (including as text context) without the source language because each source language generally has one target language, making the language identification task easier or trivial.