Table of Contents
Fetching ...

PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model

Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

TL;DR

The paper tackles the challenge of pitch controllability in diffusion-based neural vocoders by introducing PeriodGrad, which conditions the DDPM reverse process on explicit periodic signals $oldsymbol{e}$ (sine-based along with V/UV) to better capture speech periodic structure. Building on PriorGrad’s adaptive energy-based prior, PeriodGrad reuses the same training framework but adds a periodic conditioning to improve $F_0$ controllability, aiming for high-quality 48 kHz singing voice generation. Empirical results show PeriodGrad offers better $F_0$ accuracy and MOS than PriorGrad and narrows the gap to PeriodNet, though it does not always surpass all aspects of PeriodNet, especially under pitch-shifted scenarios. The work demonstrates that incorporating explicit periodic information into DDPM-based vocoders can significantly enhance pitch controllability and high-frequency synthesis, with future directions including broader waveform types and pitch-spectral disentanglement for robustness.

Abstract

This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-autoregressive models that can generate high-quality waveforms. The neural vocoders based on DDPM have the advantage of training with a simple time-domain loss. In practical applications, such as singing voice synthesis, there is a demand for neural vocoders to generate high-fidelity speech waveforms with flexible pitch control. However, conventional DDPM-based neural vocoders struggle to generate speech waveforms under such conditions. Our proposed model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.

PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model

TL;DR

The paper tackles the challenge of pitch controllability in diffusion-based neural vocoders by introducing PeriodGrad, which conditions the DDPM reverse process on explicit periodic signals (sine-based along with V/UV) to better capture speech periodic structure. Building on PriorGrad’s adaptive energy-based prior, PeriodGrad reuses the same training framework but adds a periodic conditioning to improve controllability, aiming for high-quality 48 kHz singing voice generation. Empirical results show PeriodGrad offers better accuracy and MOS than PriorGrad and narrows the gap to PeriodNet, though it does not always surpass all aspects of PeriodNet, especially under pitch-shifted scenarios. The work demonstrates that incorporating explicit periodic information into DDPM-based vocoders can significantly enhance pitch controllability and high-frequency synthesis, with future directions including broader waveform types and pitch-spectral disentanglement for robustness.

Abstract

This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-autoregressive models that can generate high-quality waveforms. The neural vocoders based on DDPM have the advantage of training with a simple time-domain loss. In practical applications, such as singing voice synthesis, there is a demand for neural vocoders to generate high-fidelity speech waveforms with flexible pitch control. However, conventional DDPM-based neural vocoders struggle to generate speech waveforms under such conditions. Our proposed model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
Paper Structure (10 sections, 7 equations, 4 figures)

This paper contains 10 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: $F_0$ contours of the natural and generated singing voices.
  • Figure 2: Results of objective evaluation for $F_0$ accuracy.
  • Figure 3: Results of subjective evaluation with 95% confidence intervals. The methods annotated with * have insufficient pitch control performance. These methods are impractical, even if the subjective rating of sound quality could be better.
  • Figure 4: Spectrograms of the natural and generated singing voices.