Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech
Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie
TL;DR
MB-MelGAN advances neural vocoding by extending MelGAN to multi-band generation and replacing feature matching with a multi-resolution STFT loss, enabling a compact model with substantial speedups. The approach achieves high MOS scores on both waveform generation and TTS tasks, while maintaining or surpassing the quality of the original MelGAN. Experimental results demonstrate a sevenfold speedup and dramatic GFLOPS reductions without sacrificing perception quality, making CPU-friendly, real-time TTS more feasible. The combination of a shared sub-band generator, pseudo-QMF filter banks, and MR-STFT losses underpins the practical improvements for end-to-end TTS systems.
Abstract
In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation, which will be open-resourced shortly, can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.
