Table of Contents
Fetching ...

Neural Speech Synthesis with Transformer Network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

TL;DR

This work addresses inefficiencies and limited long-range dependency modeling in end-to-end neural TTS by adapting the Transformer to Tacotron2, including phoneme inputs, scaled positional encoding, and prenets. The proposed Transformer-TTS demonstrates about a 4.25× training speedup and MOS scores competitive with Tacotron2, approaching human quality when paired with a WaveNet vocoder. Ablation studies reveal that trainable positional scales and careful pre-net centering improve performance, while deeper/denser configurations yield prosodic and spectrogram gains at higher computational cost. The results suggest Transformer-based TTS can deliver natural-sounding speech with substantial training efficiency, and point to future work on non-autoregressive architectures to further speed inference and mitigate autoregressive biases.

Abstract

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

Neural Speech Synthesis with Transformer Network

TL;DR

This work addresses inefficiencies and limited long-range dependency modeling in end-to-end neural TTS by adapting the Transformer to Tacotron2, including phoneme inputs, scaled positional encoding, and prenets. The proposed Transformer-TTS demonstrates about a 4.25× training speedup and MOS scores competitive with Tacotron2, approaching human quality when paired with a WaveNet vocoder. Ablation studies reveal that trainable positional scales and careful pre-net centering improve performance, while deeper/denser configurations yield prosodic and spectrogram gains at higher computational cost. The results suggest Transformer-based TTS can deliver natural-sounding speech with substantial training efficiency, and point to future work on non-autoregressive architectures to further speed inference and mitigate autoregressive biases.

Abstract

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: System architecture of Tacotron2.
  • Figure 2: System architecture of Transformer.
  • Figure 3: System architecture of our model.
  • Figure 4: Mel spectrogram comparison. Our model (6-layer) does better in reconstructing details as marked in red rectangles, while Tacotron2 and our 3-layer model blur the texture especially in high frequency region. Best viewed in color.
  • Figure 5: PE scale of encoder and decoder.