Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Takaaki Saeki; Gary Wang; Nobuyuki Morioka; Isaac Elias; Kyle Kastner; Fadi Biadsy; Andrew Rosenberg; Bhuvana Ramabhadran; Heiga Zen; Françoise Beaufays; Hadar Shemtov

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

TL;DR

The paper tackles the challenge of expanding TTS language coverage beyond resource-rich languages by introducing a joint speech-text representation framework that leverages found data (speech-text pairs, untranscribed speech, unspoken text). It uses a four-component architecture (S2F, F2T, T2F, F2S) with a fixed joint feature space $Z$, foundation-model pretraining, and curriculum training to enable zero- and few-shot learning across 100+ languages. Key results show intelligible zero-shot synthesis in over 30 unseen languages (CER difference $<10\%$) and dramatic improvements with as little as 15 minutes of transcribed data (CER difference $\le 1\%$ and MOS comparable to ground-truth in several languages). This approach demonstrates scalable multilingual TTS using publicly available data, reducing reliance on costly transcriptions and enabling broader language coverage.

Abstract

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

TL;DR

, foundation-model pretraining, and curriculum training to enable zero- and few-shot learning across 100+ languages. Key results show intelligible zero-shot synthesis in over 30 unseen languages (CER difference

) and dramatic improvements with as little as 15 minutes of transcribed data (CER difference

and MOS comparable to ground-truth in several languages). This approach demonstrates scalable multilingual TTS using publicly available data, reducing reliance on costly transcriptions and enabling broader language coverage.

Abstract

Paper Structure (13 sections, 1 equation, 3 figures, 3 tables)

This paper contains 13 sections, 1 equation, 3 figures, 3 tables.

Introduction
Related Work
Proposed framework
Training Objective
Speech-text encoder pretraining
Joint training with unsupervised speech-text data
Curriculum training procedures
Experimental Setting
Results
Main results for TTS evaluations
Analysis for TTS language expansion
Ablation studies
Conclusions

Figures (3)

Figure 1: Supervised learning with paired speech-text data.
Figure 2: Self-supervised speech-text pretraining and unsupervised speech-text injection using untranscribed speech and unspoken text.
Figure 3: CER difference to groundtruth data, and number of languages.

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

TL;DR

Abstract

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Authors

TL;DR

Abstract

Table of Contents

Figures (3)