Table of Contents
Fetching ...

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Saierdaer Yusuyin, Te Ma, Hao Huang, Wenbo Zhao, Zhijian Ou

TL;DR

The paper addresses data efficiency in multilingual and crosslingual ASR by comparing phonetic, graphemic, and self-supervised pretraining under a unified setup. It introduces Whistle, a weakly phonetic supervision approach that uses IPA labels generated by LanguageNet G2P, evaluated on CV-Lang10 with 10 seen and 2 unseen languages. Phoneme-based pretraining yields superior multilingual data-efficiency and crosslingual data-efficiency, especially in low-data regimes, and demonstrates better training efficiency and resilience to catastrophic forgetting compared with graphemic and self-supervised baselines. The work provides a reproducible pipeline and releases code, models, and data to advance research in data-efficient MCL-ASR.

Abstract

There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

TL;DR

The paper addresses data efficiency in multilingual and crosslingual ASR by comparing phonetic, graphemic, and self-supervised pretraining under a unified setup. It introduces Whistle, a weakly phonetic supervision approach that uses IPA labels generated by LanguageNet G2P, evaluated on CV-Lang10 with 10 seen and 2 unseen languages. Phoneme-based pretraining yields superior multilingual data-efficiency and crosslingual data-efficiency, especially in low-data regimes, and demonstrates better training efficiency and resilience to catastrophic forgetting compared with graphemic and self-supervised baselines. The work provides a reproducible pipeline and releases code, models, and data to advance research in data-efficient MCL-ASR.

Abstract

There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.
Paper Structure (22 sections, 2 equations, 4 figures, 9 tables)

This paper contains 22 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Illustration of the pretraining and finetuning procedures with (a) phonetic supervision, (b) subword supervision, and (c) self-supervision.
  • Figure 2: Counts of phoneme and subword units in the CV-Lang10 training set. Note that this is a log-log plot. The distribution of subwords has a sharp peak around a few top subwords and a severe long tail, which shows a more severe data imbalance than the distribution of phonemes.
  • Figure 3: Relative reduction in WER (RRWER) (comparing phoneme pretraining (M1) against subword pretraining (M4) in multilingual speech recognition), as a function of relative increase in phoneme occurrences (RIPO), for the ten languages in CV-Lang10. The figure shows the line of best linear fit: $\text{RRWER} = 0.39 \times \text{RIPO} + 6.6$.
  • Figure 4: Visualization of embeddings by t-SNE. (a) Phoneme embeddings from M1, (b) Subword embeddings from M4. In (a), blue indicate the consonants and red indicate the vowels.