Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models
Asad Ullah, Alessandro Ragano, Andrew Hines
TL;DR
This paper tackles the data scarcity challenge in self-supervised speech representation learning (SSRL) for low-resource languages by comparing cross-lingual pre-training with artificial data augmentation during pre-training. Using an APC-based model trained on a 25-hour English proxy, the study examines pitch-only, noise-only, and combined pitch+noise augmentations versus adding other languages (African-accented English, Mandarin) and evaluates on phoneme recognition. The results show that combined pitch and noise augmentation delivers the strongest downstream gains, outperforming accent and cross-lingual transfer, and that roughly 17× augmentation is required to match performance achieved with target-language data. The findings suggest augmentation as a practical strategy to bolster SSRL in resource-constrained settings and highlight potential synergy with cross-lingual pre-training, with future work including TTS-based augmentation and broader open-source model applications.
Abstract
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition. Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available. A common approach is cross-lingual pre-training. Instead, we propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition. Our comparisons found that a combined synthetic augmentations (noise/pitch) strategy outperformed accent and language knowledge transfer. Furthermore, we examined the scaling factor of augmented data to achieve equivalent performance to model pre-trained with target domain speech. Our findings suggest that for resource-constrained languages, combined augmentations can be a viable option than other augmentations.
