Unsupervised pretraining transfers well across languages
Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, Emmanuel Dupoux
TL;DR
This study demonstrates that unsupervised pretraining via Contrastive Predictive Coding can transfer phoneme representations across languages, including low-resource ones. By stabilizing training and substituting a Transformer-based predictor, the authors show that frozen, pretrained CPC features can significantly improve cross-language phoneme classification with minimal supervision, and that larger unlabelled datasets reduce the gap to supervised pretraining, even surpassing it in some cases. The work employs LibriSpeech for pretraining, transfers to Common Voice languages, and uses ZeroSpeech ABX to quantify cross-language discriminability, underscoring the practical potential of unsupervised pretraining for resource-scarce languages. Overall, unsupervised CPC pretraining emerges as a viable, scalable approach for cross-lingual phoneme representations when labeled data is scarce.
Abstract
Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been extensively investigated in the supervised setting. This assumes the existence of a parallel corpus of speech and orthographic transcriptions. Recently, contrastive predictive coding (CPC) algorithms have been proposed to pretrain ASR systems with unlabelled data. In this work, we investigate whether unsupervised pretraining transfers well across languages. We show that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining. This shows the potential of unsupervised methods for languages with few linguistic resources.
