Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov
TL;DR
The paper tackles the challenge of expanding TTS language coverage beyond resource-rich languages by introducing a joint speech-text representation framework that leverages found data (speech-text pairs, untranscribed speech, unspoken text). It uses a four-component architecture (S2F, F2T, T2F, F2S) with a fixed joint feature space $Z$, foundation-model pretraining, and curriculum training to enable zero- and few-shot learning across 100+ languages. Key results show intelligible zero-shot synthesis in over 30 unseen languages (CER difference $<10\%$) and dramatic improvements with as little as 15 minutes of transcribed data (CER difference $\le 1\%$ and MOS comparable to ground-truth in several languages). This approach demonstrates scalable multilingual TTS using publicly available data, reducing reliance on costly transcriptions and enabling broader language coverage.
Abstract
Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.
