Table of Contents
Fetching ...

GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model

Haocheng Liu, Teysir Baoueb, Mathieu Fontaine, Jonathan Le Roux, Gael Richard

TL;DR

This work addresses conditioning accuracy and cross-speaker generalization in diffusion-based speech generation by introducing GLA-Grad, an inference-time extension that applies Griffin-Lim phase retrieval at each reverse-diffusion step using a magnitude target derived from the conditioning mel spectrogram. The approach requires no retraining and can be applied to pre-trained waveform generators, improving convergence and robustness for unseen speakers. Empirical results on LJ Speech and VCTK show that GLA-Grad maintains competitiveness with WaveGrad and SpecGrad in closed-set conditions but yields clear advantages in cross-speaker and domain-adaptation scenarios, with reduced performance variability. The method achieves a practical balance between quality and speed by leveraging a lightweight phase-correction module that complements existing diffusion-based speech models.

Abstract

Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.

GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model

TL;DR

This work addresses conditioning accuracy and cross-speaker generalization in diffusion-based speech generation by introducing GLA-Grad, an inference-time extension that applies Griffin-Lim phase retrieval at each reverse-diffusion step using a magnitude target derived from the conditioning mel spectrogram. The approach requires no retraining and can be applied to pre-trained waveform generators, improving convergence and robustness for unseen speakers. Empirical results on LJ Speech and VCTK show that GLA-Grad maintains competitiveness with WaveGrad and SpecGrad in closed-set conditions but yields clear advantages in cross-speaker and domain-adaptation scenarios, with reduced performance variability. The method achieves a practical balance between quality and speed by leveraging a lightweight phase-correction module that complements existing diffusion-based speech models.

Abstract

Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
Paper Structure (11 sections, 12 equations, 1 figure, 4 tables)