From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng
TL;DR
This work tackles data scarcity in multilingual ASR by introducing Speech Back-Translation, a scalable pipeline that converts limited real transcriptions into massive quantities of synthetic speech using zero-shot TTS models. It introduces an intelligibility-based metric, Norm_I, to gauge synthetic speech quality and establish thresholds for effective ASR augmentation. By generating up to 500k hours of synthetic speech across ten languages and continuing pre-training of Whisper-large-v3, the approach delivers about a 30% reduction in transcription error rates on average, with pronounced benefits for low-resource languages. The results demonstrate strong scalability, improved generalization to out-of-domain data, and practical pathways for leveraging limited in-domain data through TTS adaptations and synthetic pre-training. This work significantly lowers the data burden for deploying high-quality multilingual ASR in resource-constrained languages and offers a framework for future improvements in synthetic data quality assessment and TTS/ASR domain adaptation.
Abstract
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
