SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis
Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard
TL;DR
SpecDiff-GAN tackles GAN training instability and slow diffusion in neural vocoders by integrating a forward diffusion process with spectrally-shaped noise into a HiFi-GAN backbone. The method introduces adaptive diffusion and a multi-resolution discriminator setup (MRD) to stabilize training and improve high-frequency detail, while using a spectrally-aware noise distribution to challenge the discriminator. The approach demonstrates improved perceptual metrics on speech (e.g., PESQ, STOI) and competitive performance in music synthesis across multiple datasets, with reduced model size and efficient inference relative to strong baselines. This work offers a versatile, diffusion-informed augmentation to GAN-based audio synthesis that translates well to both speech and instrumental music generation, with potential for broader universal audio synthesis.
Abstract
Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.
