Table of Contents
Fetching ...

Neural source-filter waveform models for statistical parametric speech synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

TL;DR

The paper tackles the bottleneck of slow autoregressive vocoders and heavy training in flow-based models by proposing a neural source-filter (NSF) framework that relies on a sine-based excitation, non-AR dilated-convolution filters, and a spectral-amplitude loss. It introduces three NSF variants (b-NSF, s-NSF, hn-NSF) and shows that hn-NSF achieves comparable speech quality to WaveNet while providing orders-of-magnitude faster generation and simpler training. Key insights come from ablation studies demonstrating the necessity of multi-resolution spectral losses, sine-based excitation, and skip connections, as well as analyses showing controllable F0 and interpretable internal dynamics. The approach offers a practical, efficient pathway for high-quality neural vocoding in statistical parametric speech synthesis with strong potential for deployment in real-time or resource-constrained settings.

Abstract

Neural waveform models such as WaveNet have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. As an autoregressive (AR) model, WaveNet is limited by a slow sequential waveform generation process. Some new models that use the inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner. However, these IAF-based models require sequential transformation during training, which severely slows down the training speed. Other models such as Parallel WaveNet and ClariNet bring together the benefits of AR and IAF-based models and train an IAF model by transferring the knowledge from a pre-trained AR teacher to an IAF student without any sequential transformation. However, both models require additional training criteria, and their implementation is prohibitively complicated. We propose a framework for neural source-filter (NSF) waveform modeling without AR nor IAF-based approaches. This framework requires only three components for waveform generation: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the acoustic features for the source and filer modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented by using short-time Fourier transform routines. Under this framework, we designed three NSF models and compared them with WaveNet. It was demonstrated that the NSF models generated waveforms at least 100 times faster than WaveNet, and the quality of the synthetic speech from the best NSF model was better than or equally good as that from WaveNet.

Neural source-filter waveform models for statistical parametric speech synthesis

TL;DR

The paper tackles the bottleneck of slow autoregressive vocoders and heavy training in flow-based models by proposing a neural source-filter (NSF) framework that relies on a sine-based excitation, non-AR dilated-convolution filters, and a spectral-amplitude loss. It introduces three NSF variants (b-NSF, s-NSF, hn-NSF) and shows that hn-NSF achieves comparable speech quality to WaveNet while providing orders-of-magnitude faster generation and simpler training. Key insights come from ablation studies demonstrating the necessity of multi-resolution spectral losses, sine-based excitation, and skip connections, as well as analyses showing controllable F0 and interpretable internal dynamics. The approach offers a practical, efficient pathway for high-quality neural vocoding in statistical parametric speech synthesis with strong potential for deployment in real-time or resource-constrained settings.

Abstract

Neural waveform models such as WaveNet have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. As an autoregressive (AR) model, WaveNet is limited by a slow sequential waveform generation process. Some new models that use the inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner. However, these IAF-based models require sequential transformation during training, which severely slows down the training speed. Other models such as Parallel WaveNet and ClariNet bring together the benefits of AR and IAF-based models and train an IAF model by transferring the knowledge from a pre-trained AR teacher to an IAF student without any sequential transformation. However, both models require additional training criteria, and their implementation is prohibitively complicated. We propose a framework for neural source-filter (NSF) waveform modeling without AR nor IAF-based approaches. This framework requires only three components for waveform generation: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the acoustic features for the source and filer modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented by using short-time Fourier transform routines. Under this framework, we designed three NSF models and compared them with WaveNet. It was demonstrated that the NSF models generated waveforms at least 100 times faster than WaveNet, and the quality of the synthetic speech from the best NSF model was better than or equally good as that from WaveNet.

Paper Structure

This paper contains 28 sections, 12 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Three types of neural waveform models in training stage. $\widehat{\boldsymbol{o}}_{1:T}$ and $\boldsymbol{o}_{1:T}$ denote generated and natural waveforms, respectively. $\boldsymbol{c}_{1:B}$ denotes input acoustic features. Red arrows denote gradients for back propagation.
  • Figure 2: Illustration of calculating three spectral distances for forward (black) and backward (red) propagation. DFT denotes discrete Fourier transform. Vectors $\widehat{\boldsymbol{x}}^{(n)}$, $\widehat{\boldsymbol{y}}^{(n)}$, and $\widehat{\boldsymbol{g}}^{(n)}$ denote windowed waveform, spectrum, and composed gradient vector for $n$-th frame.
  • Figure 3: Structure of baseline NSF (b-NSF) model. $B$ and $T$ denote lengths of input feature sequence and output waveform, respectively. FF, CONV, and Bi-LSTM denote feedforward, convolutional, and bi-directional recurrent layers, respectively. Structure of dilated-CONV filter block is plotted in Figure \ref{['fig:fig_filter_module']}.
  • Figure 4: Structure of dilated-CONV-based filter blocks for b-NSF model (top) and simplified NSF (s-NSF) model (bottom). $\boldsymbol{v}_{1:T}^{\text{in}}$ and $\boldsymbol{v}_{1:T}^{\text{out}}$ denote input and output of one filter block, respectively. $\odot$ denotes element-wise product. Every filter block contains 10 dilated-convolution layers, and every CONV and FF layer use tanh activation function.
  • Figure 5: Example of F0 sequence $\boldsymbol{f}_{1:T}$ and fundamental component $\boldsymbol{e}_{1:T}^{<0>}$
  • ...and 7 more figures