Table of Contents
Fetching ...

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

TL;DR

This work tackles scalable text-to-speech synthesis across 7212 languages, including many with no data, by pretraining on 462 languages (≈18,000 hours) and conditioning a language-agnostic model with learned language embeddings. A Language Embedding Space Structure (LESS) loss aligns embedding distances with multiple language-similarity metrics, while a meta-learning module estimates embeddings for unseen languages by averaging nearby supervised languages (with $5 \le k \le 25$). The system achieves competitive objective and subjective performance against a strong baseline while substantially expanding language coverage, and it is released openly to empower resource-limited communities. The results demonstrate the feasibility of zero-shot TTS across a broad typological spectrum and highlight practical considerations for safe, ethical deployment and community engagement.

Abstract

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

TL;DR

This work tackles scalable text-to-speech synthesis across 7212 languages, including many with no data, by pretraining on 462 languages (≈18,000 hours) and conditioning a language-agnostic model with learned language embeddings. A Language Embedding Space Structure (LESS) loss aligns embedding distances with multiple language-similarity metrics, while a meta-learning module estimates embeddings for unseen languages by averaging nearby supervised languages (with ). The system achieves competitive objective and subjective performance against a strong baseline while substantially expanding language coverage, and it is released openly to empower resource-limited communities. The results demonstrate the feasibility of zero-shot TTS across a broad typological spectrum and highlight practical considerations for safe, ethical deployment and community engagement.

Abstract

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
Paper Structure (16 sections, 2 equations, 3 figures, 3 tables)

This paper contains 16 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An overview of the coverage of supervised (462) and zero-shot (6750) languages in our work on the world map.
  • Figure 2: Reconstruction error for approximating the 462 language embeddings from our supervised set using their $k$ nearest neighbors, which are determined either at random, via distance metrics (inverse ASP, tree distance, map distance), their average (avg), or our meta-learned distance function.
  • Figure 3: Boxplots for the listening test results.