Table of Contents
Fetching ...

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, Gil Keren, Artyom Kozhevnikov, Yen Meng, Christophe Ropers, Matthew Setzler, Skyler Wang, Ife Adebara, Michael Auli, Can Balioglu, Kevin Chan, Chierh Cheng, Joe Chuang, Caley Droof, Mark Duppenthaler, Paul-Ambroise Duquenne, Alexander Erben, Cynthia Gao, Gabriel Mejia Gonzalez, Kehan Lyu, Sagar Miglani, Vineel Pratap, Kaushik Ram Sadagopan, Safiyyah Saleem, Arina Turkatenko, Albert Ventayol-Boada, Zheng-Xin Yong, Yu-An Chung, Jean Maillard, Rashel Moritz, Alexandre Mourachko, Mary Williamson, Shireen Yates

TL;DR

Omnilingual ASR addresses the exclusion of the world’s languages from speech technology by delivering an extensible, open-source multilingual ASR framework. It combines a massively multilingual self-supervised speech encoder with an LM-inspired decoder to enable zero-shot transcription for unseen languages, all trained on a vast, diverse AllASR corpus and accompanied by the Omnilingual Corpus. The work demonstrates strong performance across 1,600+ languages, including hundreds never before served, and shows benefits in low-resource settings, zero-shot generalization, and cross-lingual speech-to-text translation, while emphasizing community collaboration and fair compensation. By releasing a spectrum of models from compact to large and providing open datasets and tooling, the project redefines language coverage as an extensible, community-driven capability rather than a fixed inventory, with broad practical implications for accessibility and language preservation.

Abstract

Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

TL;DR

Omnilingual ASR addresses the exclusion of the world’s languages from speech technology by delivering an extensible, open-source multilingual ASR framework. It combines a massively multilingual self-supervised speech encoder with an LM-inspired decoder to enable zero-shot transcription for unseen languages, all trained on a vast, diverse AllASR corpus and accompanied by the Omnilingual Corpus. The work demonstrates strong performance across 1,600+ languages, including hundreds never before served, and shows benefits in low-resource settings, zero-shot generalization, and cross-lingual speech-to-text translation, while emphasizing community collaboration and fair compensation. By releasing a spectrum of models from compact to large and providing open datasets and tooling, the project redefines language coverage as an extensible, community-driven capability rather than a fixed inventory, with broad practical implications for accessibility and language preservation.

Abstract

Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.

Paper Structure

This paper contains 58 sections, 5 equations, 8 figures, 28 tables.

Figures (8)

  • Figure 1: Photographs documenting key moments from the global collection of speech data that produced the Omnilingual ASR Corpus.
  • Figure 2: Commissioned data quality-assurance workflow.
  • Figure 3: Statistics of the AllASR labeled data (hours of speech recordings paired with transcription) used to pre-train Omnilingual ASR.
  • Figure 4: Statistics of the unlabeled data (hours of speech recordings) used to fine-tune Omnilingual ASR for the ASR task.
  • Figure 5: The LLM-ASR model architecture. A wav2vec 2.0 speech encoder and a text embedding matrix embed the speech and text modalities. An autoregressive Transformer decoder emits text tokens, and the system is trained with a next-token prediction objective.
  • ...and 3 more figures