Table of Contents
Fetching ...

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu

TL;DR

This work tackles multispeaker text-to-speech with zero-shot adaptation by decoupling speaker modeling from synthesis. It introduces a pre-trained speaker encoder trained on untranscribed data to produce d-vector embeddings, a Tacotron2-based synthesizer conditioned on these embeddings, and a WaveNet vocoder for waveform generation. The approach demonstrates that speech characteristics from unseen speakers can be transferred to the TTS model, with analysis showing the importance of large, diverse encoder training data for robust generalization and the existence of a meaningful speaker embedding space. While achieving natural-sounding speech and plausible speaker similarity, the method acknowledges limitations in accent transfer and the need for larger embedding capacity or active adaptation for finer speaker nuances. Overall, the study presents a scalable strategy for zero-shot multispeaker TTS with practical implications for accessibility and multilingual applications, along with clear paths for improvement.

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

TL;DR

This work tackles multispeaker text-to-speech with zero-shot adaptation by decoupling speaker modeling from synthesis. It introduces a pre-trained speaker encoder trained on untranscribed data to produce d-vector embeddings, a Tacotron2-based synthesizer conditioned on these embeddings, and a WaveNet vocoder for waveform generation. The approach demonstrates that speech characteristics from unseen speakers can be transferred to the TTS model, with analysis showing the importance of large, diverse encoder training data for robust generalization and the existence of a meaningful speaker embedding space. While achieving natural-sounding speech and plausible speaker similarity, the method acknowledges limitations in accent transfer and the need for larger embedding capacity or active adaptation for finer speaker nuances. Overall, the study presents a scalable strategy for zero-shot multispeaker TTS with practical implications for accessibility and multilingual applications, along with clear paths for improvement.

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Paper Structure

This paper contains 21 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Model overview. Each of the three components are trained independently.
  • Figure 2: Example synthesis of a sentence in different voices using the proposed system. Mel spectrograms are visualized for reference utterances used to generate speaker embeddings (left), and the corresponding synthesizer outputs (right). The text-to-spectrogram alignment is shown in red. Three speakers held out of the train sets are used: one male (top) and two female (center and bottom).
  • Figure 3: Visualization of speaker embeddings extracted from LibriSpeech utterances. Each color corresponds to a different speaker. Real and synthetic utterances appear nearby when they are from the same speaker, however real and synthetic utterances consistently form distinct clusters.
  • Figure 4: Per-speaker naturalness MOS of ground truth and synthesized speech on unseen VCTK speakers.
  • Figure 5: Example synthesis of a sentence conditioned on several random speaker embeddings sampled from the unit hypersphere. All samples contain consistent phonetic content, but there is clear variation in fundamental frequency and speaking rate. Audio files corresponding to these utterances are included in the demo page (https://google.github.io/tacotron/publications/speaker_adaptation).
  • ...and 1 more figures