Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram
Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
TL;DR
Autoregressive vocoders deliver high quality but are slow; distillation-based methods reduce speed but require complex two-stage training. The authors propose Parallel WaveGAN, a distillation-free GAN-based non-autoregressive WaveNet trained with a joint objective of multi-resolution STFT loss and adversarial loss, enabling fast training and real-time-like generation with a small parameter count. On 24 kHz speech, it achieves 28.68x real-time generation on a single GPU and MOS 4.16 within a Transformer-based TTS system, competitive with ClariNet. This work demonstrates that a simple GAN-based approach can yield high perceptual quality with significant speedups and simpler training, suggesting practical deployment for real-time TTS.
Abstract
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.
