Table of Contents
Fetching ...

Unsupervised pretraining transfers well across languages

Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux

TL;DR

This study demonstrates that unsupervised pretraining via Contrastive Predictive Coding can transfer phoneme representations across languages, including low-resource ones. By stabilizing training and substituting a Transformer-based predictor, the authors show that frozen, pretrained CPC features can significantly improve cross-language phoneme classification with minimal supervision, and that larger unlabelled datasets reduce the gap to supervised pretraining, even surpassing it in some cases. The work employs LibriSpeech for pretraining, transfers to Common Voice languages, and uses ZeroSpeech ABX to quantify cross-language discriminability, underscoring the practical potential of unsupervised pretraining for resource-scarce languages. Overall, unsupervised CPC pretraining emerges as a viable, scalable approach for cross-lingual phoneme representations when labeled data is scarce.

Abstract

Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

Unsupervised pretraining transfers well across languages

TL;DR

This study demonstrates that unsupervised pretraining via Contrastive Predictive Coding can transfer phoneme representations across languages, including low-resource ones. By stabilizing training and substituting a Transformer-based predictor, the authors show that frozen, pretrained CPC features can significantly improve cross-language phoneme classification with minimal supervision, and that larger unlabelled datasets reduce the gap to supervised pretraining, even surpassing it in some cases. The work employs LibriSpeech for pretraining, transfers to Common Voice languages, and uses ZeroSpeech ABX to quantify cross-language discriminability, underscoring the practical potential of unsupervised pretraining for resource-scarce languages. Overall, unsupervised CPC pretraining emerges as a viable, scalable approach for cross-lingual phoneme representations when labeled data is scarce.

Abstract

Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.

Paper Structure

This paper contains 23 sections, 2 equations, 1 figure, 8 tables.

Figures (1)

  • Figure S1: CPC versus Bottleneck features. The CPC features here have been trained on the 60Kh libri-light dataset.