Non-Autoregressive Neural Text-to-Speech
Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao
TL;DR
This paper introduces ParaNet, a fully convolutional non-autoregressive text-to-spectrogram model that significantly speeds up synthesis while maintaining quality. It leverages layer-wise attention refinement and distillation from a teacher to stabilize text–speech alignment, enabling a fully parallel TTS pipeline when paired with parallel vocoders. The authors also propose WaveVAE, a VAE-based approach to train a parallel vocoder from scratch, and compare flow-based and autoregressive-inspired vocoders. Experimental results show substantial speedups over autoregressive baselines and competitive perceptual quality, with ablations highlighting the importance of attention distillation, positional encoding, and decoder depth. Overall, the work advances rapid, end-to-end parallel TTS with robust alignment and practical deployment potential.
Abstract
In this work, we propose ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality. ParaNet also produces stable alignment between text and speech on the challenging test sentences by iteratively improving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distillation from a separately trained WaveNet as previous work.
