Meta Learning Text-to-Speech Synthesis in over 7000 Languages
Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu
TL;DR
This work tackles scalable text-to-speech synthesis across 7212 languages, including many with no data, by pretraining on 462 languages (≈18,000 hours) and conditioning a language-agnostic model with learned language embeddings. A Language Embedding Space Structure (LESS) loss aligns embedding distances with multiple language-similarity metrics, while a meta-learning module estimates embeddings for unseen languages by averaging nearby supervised languages (with $5 \le k \le 25$). The system achieves competitive objective and subjective performance against a strong baseline while substantially expanding language coverage, and it is released openly to empower resource-limited communities. The results demonstrate the feasibility of zero-shot TTS across a broad typological spectrum and highlight practical considerations for safe, ethical deployment and community engagement.
Abstract
In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
