DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Roi Benita; Michael Elad; Joseph Keshet

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Roi Benita, Michael Elad, Joseph Keshet

TL;DR

The proposed diffusion probabilistic end-to-end model for generating a raw speech waveform is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.

Abstract

Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

TL;DR

Abstract

Paper Structure (18 sections, 8 equations, 9 figures, 8 tables)

This paper contains 18 sections, 8 equations, 9 figures, 8 tables.

Introduction
Proposed model
Text representation as linguistic and phonological units
Model architecture
Experiments
Unconditional speech generation
Conditional Speech Generation
Ablation study
Conclusion
Reproducibility
Stochasticity and controllability through the generative process
Vocal Fry
Detailed architecture
Duration predictor
Energy predictor
...and 3 more sections

Figures (9)

Figure 1: The autoregressive model uses part of the previous frame to generate the current frame.
Figure 2: (a) A general overview of the structure of the residual layers and their interconnections. (b) A detailed overview of a single residual layer.
Figure 3: Comparing the energy and pitch of five samples that describe the same text, with the desired energy and pitch values marked in red.
Figure 4: Displaying the vocal fry phenomenon across various models
Figure 5: A detailed overview of a single residual layer
...and 4 more figures

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

TL;DR

Abstract

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)