Table of Contents
Fetching ...

2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset

Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, Belen Alastruey, Necati Cihan Camgoz, Joe Chuang, Jean Maillard, Christophe Ropers, Arina Turkantenko, Carleigh Wood

TL;DR

2M-Belebele introduces the first highly multilingual corpus for speech and ASL comprehension, extending Belebele to 74 spoken languages and one sign language, with a parallelization strategy aligned to FLEURS to create 2M-FLORES. The paper provides a practical dataset construction pipeline (speech and ASL recordings, gloss annotations, and quality controls) and offers baseline experiments using cascaded ASR plus LLM and multimodal systems under 5-shot and zero-shot settings across 39 shared languages. Empirical results show that speech-based comprehension trails text-based comprehension by about 2-3 percentage points on average, and ASL presents distinct challenges requiring gloss-based data and robust sign-language modeling. By releasing 2M-Belebele (and 2M-FLORES) publicly, the work enables cross-language, cross-modality evaluation and paves the way for future multimodal, multilingual understanding research, including deeper ASL translation capabilities.

Abstract

We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL). We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings and across languages, the speech comprehension accuracy is ~ 2-3% average lower compared to reading comprehension.

2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset

TL;DR

2M-Belebele introduces the first highly multilingual corpus for speech and ASL comprehension, extending Belebele to 74 spoken languages and one sign language, with a parallelization strategy aligned to FLEURS to create 2M-FLORES. The paper provides a practical dataset construction pipeline (speech and ASL recordings, gloss annotations, and quality controls) and offers baseline experiments using cascaded ASR plus LLM and multimodal systems under 5-shot and zero-shot settings across 39 shared languages. Empirical results show that speech-based comprehension trails text-based comprehension by about 2-3 percentage points on average, and ASL presents distinct challenges requiring gloss-based data and robust sign-language modeling. By releasing 2M-Belebele (and 2M-FLORES) publicly, the work enables cross-language, cross-modality evaluation and paves the way for future multimodal, multilingual understanding research, including deeper ASL translation capabilities.

Abstract

We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL). We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings and across languages, the speech comprehension accuracy is ~ 2-3% average lower compared to reading comprehension.

Paper Structure

This paper contains 21 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: FLEURS vs New Recordings from 2M-Belebele for sentences in passages.
  • Figure 2: Speech and Text Belebele accuracy results in 39 languages. We compare text performance with Llama-3-chat (zero-shot) and speech performance with Whisper+Llama-3-chat (asr+zero-shot).