2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset
Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, Belen Alastruey, Necati Cihan Camgoz, Joe Chuang, Jean Maillard, Christophe Ropers, Arina Turkantenko, Carleigh Wood
TL;DR
2M-Belebele introduces the first highly multilingual corpus for speech and ASL comprehension, extending Belebele to 74 spoken languages and one sign language, with a parallelization strategy aligned to FLEURS to create 2M-FLORES. The paper provides a practical dataset construction pipeline (speech and ASL recordings, gloss annotations, and quality controls) and offers baseline experiments using cascaded ASR plus LLM and multimodal systems under 5-shot and zero-shot settings across 39 shared languages. Empirical results show that speech-based comprehension trails text-based comprehension by about 2-3 percentage points on average, and ASL presents distinct challenges requiring gloss-based data and robust sign-language modeling. By releasing 2M-Belebele (and 2M-FLORES) publicly, the work enables cross-language, cross-modality evaluation and paves the way for future multimodal, multilingual understanding research, including deeper ASL translation capabilities.
Abstract
We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL). We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings and across languages, the speech comprehension accuracy is ~ 2-3% average lower compared to reading comprehension.
