Table of Contents
Fetching ...

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic

TL;DR

ResGrad tackles the slow inference of diffusion-based TTS by learning a residual diffusion that refines the mel-spectrogram output of an existing non-iterative TTS model. By predicting the difference between ground-truth mel and the base model’s output, ResGrad operates in a lighter learning space and can be applied in a plug-and-play fashion without retraining the base model. Across LJ-Speech, LibriTTS, and VCTK, it delivers higher MOS at the same real-time factor and can be over 10× faster than comparable diffusion-based speedups at similar quality. These results underscore ResGrad’s potential to enable real-time diffusion-based TTS with minimal architectural changes to existing systems and motivate extending the residual-diffusion idea to other domains.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

TL;DR

ResGrad tackles the slow inference of diffusion-based TTS by learning a residual diffusion that refines the mel-spectrogram output of an existing non-iterative TTS model. By predicting the difference between ground-truth mel and the base model’s output, ResGrad operates in a lighter learning space and can be applied in a plug-and-play fashion without retraining the base model. Across LJ-Speech, LibriTTS, and VCTK, it delivers higher MOS at the same real-time factor and can be over 10× faster than comparable diffusion-based speedups at similar quality. These results underscore ResGrad’s potential to enable real-time diffusion-based TTS with minimal architectural changes to existing systems and motivate extending the residual-diffusion idea to other domains.

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
Paper Structure (31 sections, 5 equations, 5 figures, 1 table)

This paper contains 31 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of ResGrad. ResGrad first predicts the residual between the mel-spectrogram estimated by an existing TTS model and the ground-truth mel-spectrogram, and then adds the residual to the estimated mel-spectrogram to get the refined mel-spectrogram.
  • Figure 2: The comparison between the residual calculated with ground-truth pitch, and the residual calculated without ground-truth pitch (i.e., pitch predicted by model).
  • Figure 3: The left column shows a ground-truth residual sample calculated with both $dur_{GT}$ and $pitch_{GT}$ (top) and the corresponding predicted residual sample (bottom), while the right column shows the corresponding ground-truth residual sample calculated with $dur_{GT}$ and $pitch_{Pred}$ (top) and the corresponding predicted residual sample (bottom).
  • Figure 4: The comparison between the mel-spectrogram generated by FastSpeech 2, the refined mel-spectrogram by ResGrad, and the ground-truth mel-spectrogram. FastSpeech 2 suffers from the over-smoothing problem, while ResGrad generates mel-spectrogram with more detailed frequency information, which is more similar to ground truth mel-spectrogram.
  • Figure 5: The decoding process of ResGrad in $4$ sampling steps.