Table of Contents
Fetching ...

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng

TL;DR

This work tackles data scarcity in multilingual ASR by introducing Speech Back-Translation, a scalable pipeline that converts limited real transcriptions into massive quantities of synthetic speech using zero-shot TTS models. It introduces an intelligibility-based metric, Norm_I, to gauge synthetic speech quality and establish thresholds for effective ASR augmentation. By generating up to 500k hours of synthetic speech across ten languages and continuing pre-training of Whisper-large-v3, the approach delivers about a 30% reduction in transcription error rates on average, with pronounced benefits for low-resource languages. The results demonstrate strong scalability, improved generalization to out-of-domain data, and practical pathways for leveraging limited in-domain data through TTS adaptations and synthetic pre-training. This work significantly lowers the data burden for deploying high-quality multilingual ASR in resource-constrained languages and offers a framework for future improvements in synthetic data quality assessment and TTS/ASR domain adaptation.

Abstract

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

TL;DR

This work tackles data scarcity in multilingual ASR by introducing Speech Back-Translation, a scalable pipeline that converts limited real transcriptions into massive quantities of synthetic speech using zero-shot TTS models. It introduces an intelligibility-based metric, Norm_I, to gauge synthetic speech quality and establish thresholds for effective ASR augmentation. By generating up to 500k hours of synthetic speech across ten languages and continuing pre-training of Whisper-large-v3, the approach delivers about a 30% reduction in transcription error rates on average, with pronounced benefits for low-resource languages. The results demonstrate strong scalability, improved generalization to out-of-domain data, and practical pathways for leveraging limited in-domain data through TTS adaptations and synthetic pre-training. This work significantly lowers the data burden for deploying high-quality multilingual ASR in resource-constrained languages and offers a framework for future improvements in synthetic data quality assessment and TTS/ASR domain adaptation.

Abstract

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

Paper Structure

This paper contains 48 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Pipeline of Speech Back-Translation. The main objective is to augment limited training data ($\leq$100 hours) for low-resource languages by synthesizing extensive amounts of speech ($>$10,000 hours). Starting from a multilingual TTS model pre-trained with high-resource languages, we fine-tune it on a small set of seed data, then generate synthetic speech by conditioning the fine-tuned model on a large textual corpus and diverse audio prompts.
  • Figure 2: XTTS inference speed measured on a single NVIDIA V100-32GB GPU. "DS" refers to DeepSpeed-Inference while "Batch" refers to batch inference. For batch inference, we set batch size to be 16.
  • Figure 3: Comparison of dataset sizes across seven languages (log-scale y-axis). Languages are categorized by resource availability in the Whisper dataset: (a) high-resource, (b) mid and low-resource groups.
  • Figure 4: Whisper's performance improves consistently with larger models and more training data. We train five sizes of Whisper models with up to 160,000 hours of data and conduct evaluation on Common Voice 16. We report averaged WER across seven languages.
  • Figure 5: Impact of training data quantity and epochs on Vietnamese TTS quality. The purple dashed line shows the WER of natural speech from Fleurs.
  • ...and 2 more figures