Table of Contents
Fetching ...

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

TL;DR

ZMM-TTS addresses the challenge of zero-shot multilingual multispeaker TTS with limited data by leveraging discrete self-supervised speech representations from a large multilingual model. It introduces a two-stage architecture (txt2vec and vec2wav) that can operate with either a Mel-based vocoder or direct waveform generation, and it integrates pre-trained phoneme encoders to improve cross-lingual transfer. Through extensive experiments across six high-resource languages and two low-resource scenarios, ZMM-TTS demonstrates superior naturalness and speaker similarity over Mel-based baselines, with strong zero-shot and few-shot adaptation, particularly when using phoneme-based inputs. The work suggests significant potential for scalable, data-efficient multilingual TTS and points to future improvements in language adaptability and representation strategies for broader language coverage.

Abstract

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

TL;DR

ZMM-TTS addresses the challenge of zero-shot multilingual multispeaker TTS with limited data by leveraging discrete self-supervised speech representations from a large multilingual model. It introduces a two-stage architecture (txt2vec and vec2wav) that can operate with either a Mel-based vocoder or direct waveform generation, and it integrates pre-trained phoneme encoders to improve cross-lingual transfer. Through extensive experiments across six high-resource languages and two low-resource scenarios, ZMM-TTS demonstrates superior naturalness and speaker similarity over Mel-based baselines, with strong zero-shot and few-shot adaptation, particularly when using phoneme-based inputs. The work suggests significant potential for scalable, data-efficient multilingual TTS and points to future improvements in language adaptability and representation strategies for broader language coverage.

Abstract

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
Paper Structure (57 sections, 8 equations, 5 figures, 9 tables)

This paper contains 57 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of ZMM-TTS. The modules txt2vec and vec2wav are trained independently. Language ID is utilized for high-resource languages and few-shot adaptation, but it is not used for direct inference without fine-tuning.
  • Figure 2: Architecture of our proposed txt2vec model.
  • Figure 3: vec2wav with independent Mel-based vocoder.
  • Figure 4: vec2wav without independent Mel-based vocoder.
  • Figure 5: Visualization of speaker embeddings extracted from ECAPA-TDNN and H/ASP.