Table of Contents
Fetching ...

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, Tie-Yan Liu

TL;DR

The paper tackles the difficulty of learning text-to-speech alignments in multi-speaker Transformer TTS under noisy data conditions. It introduces three techniques—diagonal attention constraint, encoder layer normalization on phoneme embeddings, and a compact decoder pre-net bottleneck—to strengthen alignment and speech quality. Empirical results on VCTK and LibriTTS show substantial MOS and alignment gains, with MultiSpeech approaching ground-truth quality and enabling a high-performing, fast multi-speaker FastSpeech model when used as a teacher. These contributions collectively advance robust, scalable multi-speaker TTS using Transformer architectures and offer practical benefits for deployment and inference speed.

Abstract

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

MultiSpeech: Multi-Speaker Text to Speech with Transformer

TL;DR

The paper tackles the difficulty of learning text-to-speech alignments in multi-speaker Transformer TTS under noisy data conditions. It introduces three techniques—diagonal attention constraint, encoder layer normalization on phoneme embeddings, and a compact decoder pre-net bottleneck—to strengthen alignment and speech quality. Empirical results on VCTK and LibriTTS show substantial MOS and alignment gains, with MultiSpeech approaching ground-truth quality and enabling a high-performing, fast multi-speaker FastSpeech model when used as a teacher. These contributions collectively advance robust, scalable multi-speaker TTS using Transformer architectures and offer practical benefits for deployment and inference speed.

Abstract

Transformer-based text to speech (TTS) model (e.g., Transformer TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the advantages of training and inference efficiency over RNN-based model (e.g., Tacotron~\cite{shen2018natural}) due to its parallel computation in training and/or inference. However, the parallel computation increases the difficulty while learning the alignment between text and speech in Transformer, which is further magnified in the multi-speaker scenario with noisy data and diverse speakers, and hinders the applicability of Transformer for multi-speaker TTS. In this paper, we develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment: 1) a diagonal constraint on the weight matrix of encoder-decoder attention in both training and inference; 2) layer normalization on phoneme embedding in encoder to better preserve position information; 3) a bottleneck in decoder pre-net to prevent copy between consecutive speech frames. Experiments on VCTK and LibriTTS multi-speaker datasets demonstrate the effectiveness of MultiSpeech: 1) it synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS; 2) with a MutiSpeech model as the teacher, we obtain a strong multi-speaker FastSpeech model with almost zero quality degradation while enjoying extremely fast inference speed.

Paper Structure

This paper contains 12 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The model structure of our proposed MultiSpeech. The green blocks are the newly added modules for multi-speaker TTS based on Transformer.
  • Figure 2: (a) The illustration of diagonal constraint in attention, where the above figure has a small diagonal constraint loss and the below figure has a large diagonal constraint loss. (b) The model structure of the pre-net bottleneck in decoder.