Table of Contents
Fetching ...

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

Tatiana Likhomanenko, Loren Lugosch, Ronan Collobert

TL;DR

This work tackles unsupervised ASR for target languages with no labeled audio by leveraging labeled data from a source language through Cross-Lingual Pseudo-Labeling (CLPL). It bootstraps a target-language acoustic model by generating pseudo-labels for target audio with a source-language AM and constraining them with a target-language LM, with an optional Phase 2 slimIPL refinement. The approach yields substantial WER/CER gains across diverse language pairs, including English→Swahili, and outperforms a recent character-based unsupervised method on LJSpeech under a data-lean setup. The method is simpler and more flexible than prior unsupervised techniques, enabling practical ASR improvements for low-resource languages and even cross-alphabet transfer, signaling strong potential for multilingual, low-resource settings.

Abstract

Recent work has shown that it is possible to train an $\textit{unsupervised}$ automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is $\textit{always}$ labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an $\textit{unsupervised}$ AM in a new language. Here, "unsupervised" means no labeled audio is available for the $\textit{target}$ language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the $\textit{target}$ language using some $\textit{other}$ language AM and (ii) constraining these PLs with a $\textit{target language model}$. Our approach is effective on Common Voice: e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data.

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

TL;DR

This work tackles unsupervised ASR for target languages with no labeled audio by leveraging labeled data from a source language through Cross-Lingual Pseudo-Labeling (CLPL). It bootstraps a target-language acoustic model by generating pseudo-labels for target audio with a source-language AM and constraining them with a target-language LM, with an optional Phase 2 slimIPL refinement. The approach yields substantial WER/CER gains across diverse language pairs, including English→Swahili, and outperforms a recent character-based unsupervised method on LJSpeech under a data-lean setup. The method is simpler and more flexible than prior unsupervised techniques, enabling practical ASR improvements for low-resource languages and even cross-alphabet transfer, signaling strong potential for multilingual, low-resource settings.

Abstract

Recent work has shown that it is possible to train an automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an AM in a new language. Here, "unsupervised" means no labeled audio is available for the language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the language using some language AM and (ii) constraining these PLs with a . Our approach is effective on Common Voice: e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data.
Paper Structure (32 sections, 1 equation, 10 figures, 14 tables)

This paper contains 32 sections, 1 equation, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Motivation: reasonable zero-shot ASR for Swahili is possible by decoding with an English acoustic model (AM) constrained by a Swahili language model (LM), suggesting that training on the resulting pseudo-labels could improve the acoustic model.
  • Figure 2: Comparison of standard monolingual pseudo-labeling and unsupervised ASR via cross-lingual pseudo-labeling, where labeled data are available for a source language and no labeled audio is available for the target language.
  • Figure 3: Zero-shot evaluation and cross-lingual pseudo-labeling word error (WER, top) and character error rates (CER, bottom) on Common Voice v12.0 for different source languages ($X\rightarrow$) with labeled data and target languages ($\rightarrow X$) with unpaired audio and text data: (i) zero-shot evaluation with a source acoustic model ("Source AM") on a target language; (ii) zero-shot evaluation of source AM coupled with a target language model ("Source AM $|$ Target LM") via LM beam-search decoding (beam size is $100$, $\alpha=1, \beta=0$, unknown words are not accepted); (iii) cross-lingual pseudo-labeling with greedy decoding ("Cross-Lingual PL") and LM beam-search decoding ("Cross-Lingual PL $|$ Target LM"). Beam size is set to 1k and $\alpha,\beta$ are tuned via random search. Supervised models trained on the same target data, and decoded with LM beam-search are given as reference baselines.
  • Figure 4: Cross-lingual PL dependence on the number of labeled data in the source language (left) and unlabeled audio in the target language (right) for $en\to sw$: for left we use all target language hours (50h) while for right we use all source language hours (1550h).
  • Figure 5: Zero-shot CER (left) and WER (right) with greedy (top) and LM beam-search decoding (bottom) on Common Voice validation sets, for models trained on a source language $X\rightarrow$ and transferred to a target language $\rightarrow X$. Beam size is set to 100 and $\alpha=1, \beta=0$. We found that German LM decoding is worse than greedy decoding because unknown words are not accepted in the decoding process.
  • ...and 5 more figures