Low-Resource Self-Supervised Learning with SSL-Enhanced TTS
Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed
TL;DR
This work tackles the dependency of self-supervised speech models on large unlabeled corpora by introducing a low-resource pipeline that synthesizes speech to augment pre-training. It builds a high-quality TTS system from limited data by deriving discrete speech units from a HuBERT model trained on ~100h, then trains a text-to-unit model, duration/pitch predictors, and a unit-to-speech module with a VAE to generate a large synthetic corpus. The approach yields up to ~90% reduction in real speech data needed for pre-training while achieving a final WER of ~15.8% on LibriSpeech dev-other, approaching topline performance. This method offers a practical path for deploying SSL-based speech systems in data-scarce settings and potentially for low-resource languages, with clear benefits in data efficiency and privacy by decoupling from raw speech data for pre-training.
Abstract
Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.
