Table of Contents
Fetching ...

Scaling A Simple Approach to Zero-Shot Speech Recognition

Jinming Zhao, Vineel Pratap, Michael Auli

TL;DR

The paper tackles the problem of scaling automatic speech recognition to thousands of languages with minimal labeled data. It introduces MMS Zero-shot, a romanization-based approach that trains a universal acoustic model on data from 1,078 languages and performs zero-shot decoding by mapping to a romanized lexicon, optionally aided by a unigram language model. The method achieves a 46% relative reduction in character error rate over ASR-2K on 107 unseen languages and approaches in-domain supervised performance at roughly 2.5× CER, demonstrating the viability of romanization as a universal text representation for multilingual ASR. The work highlights the practical potential for broad language coverage with modest text data and points to remaining gaps relative to fully supervised in-domain systems, especially in domain-mismatch scenarios.

Abstract

Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.

Scaling A Simple Approach to Zero-Shot Speech Recognition

TL;DR

The paper tackles the problem of scaling automatic speech recognition to thousands of languages with minimal labeled data. It introduces MMS Zero-shot, a romanization-based approach that trains a universal acoustic model on data from 1,078 languages and performs zero-shot decoding by mapping to a romanized lexicon, optionally aided by a unigram language model. The method achieves a 46% relative reduction in character error rate over ASR-2K on 107 unseen languages and approaches in-domain supervised performance at roughly 2.5× CER, demonstrating the viability of romanization as a universal text representation for multilingual ASR. The work highlights the practical potential for broad language coverage with modest text data and points to remaining gaps relative to fully supervised in-domain systems, especially in domain-mismatch scenarios.

Abstract

Despite rapid progress in increasing the language coverage of automatic speech recognition, the field is still far from covering all languages with a known writing script. Recent work showed promising results with a zero-shot approach requiring only a small amount of text data, however, accuracy heavily depends on the quality of the used phonemizer which is often weak for unseen languages. In this paper, we present MMS Zero-shot a conceptually simpler approach based on romanization and an acoustic model trained on data in 1,078 different languages or three orders of magnitude more than prior art. MMS Zero-shot reduces the average character error rate by a relative 46% over 100 unseen languages compared to the best previous work. Moreover, the error rate of our approach is only 2.5x higher compared to in-domain supervised baselines, while our approach uses no labeled data for the evaluation languages at all.
Paper Structure (15 sections, 2 figures, 5 tables)

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: MMS Zero-shot. We build a universal acoustic model by fine-tuning a pre-trained wav2vec 2.0 model on romanized transcripts (left). A new language is transcribed by performing beam search decoding with a lexicon mapping words in the new language to romanized text. If available, then a language model can be used to improve performance (right).
  • Figure 2: Accuracy on FLEURS dev languages when increasing the amount of CommonCrawl text data to build lexicons and unigram LMs compared to using in-domain text data (FLEURS-topline; 3k utterances).