Table of Contents
Fetching ...

Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

TL;DR

This paper introduces ParaNet, a fully convolutional non-autoregressive text-to-spectrogram model that significantly speeds up synthesis while maintaining quality. It leverages layer-wise attention refinement and distillation from a teacher to stabilize text–speech alignment, enabling a fully parallel TTS pipeline when paired with parallel vocoders. The authors also propose WaveVAE, a VAE-based approach to train a parallel vocoder from scratch, and compare flow-based and autoregressive-inspired vocoders. Experimental results show substantial speedups over autoregressive baselines and competitive perceptual quality, with ablations highlighting the importance of attention distillation, positional encoding, and decoder depth. Overall, the work advances rapid, end-to-end parallel TTS with robust alignment and practical deployment potential.

Abstract

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Non-Autoregressive Neural Text-to-Speech

TL;DR

This paper introduces ParaNet, a fully convolutional non-autoregressive text-to-spectrogram model that significantly speeds up synthesis while maintaining quality. It leverages layer-wise attention refinement and distillation from a teacher to stabilize text–speech alignment, enabling a fully parallel TTS pipeline when paired with parallel vocoders. The authors also propose WaveVAE, a VAE-based approach to train a parallel vocoder from scratch, and compare flow-based and autoregressive-inspired vocoders. Experimental results show substantial speedups over autoregressive baselines and competitive perceptual quality, with ablations highlighting the importance of attention distillation, positional encoding, and decoder depth. Overall, the work advances rapid, end-to-end parallel TTS with robust alignment and practical deployment potential.

Abstract

In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.

Paper Structure

This paper contains 22 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) Autoregressive seq2seq model. The dashed line depicts the autoregressive decoding of mel spectrogram at inference. (b) Non-autoregressive ParaNet model, which distills the attention from a pretrained autoregressive model.
  • Figure 2: (a) Architecture of ParaNet. Its encoder provides key and value as the textual representation. The first attention block in decoder gets positional encoding as the query and is followed by non-causal convolution blocks and attention blocks. (b) Convolution block appears in both encoder and decoder. It consists of a 1-D convolution with a gated linear unit (GLU) and a residual connection.
  • Figure 3: Our ParaNet iteratively refines the attention alignment in a layer-by-layer way. One can see the 1st layer attention is mostly dominated by the positional encoding prior. It becomes more and more confident about the alignment in the subsequent layers.