Table of Contents
Fetching ...

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Yang Ai, Zhen-Hua Ling

TL;DR

A novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra only via neural networks, and is the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Abstract

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

TL;DR

A novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra only via neural networks, and is the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Abstract

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
Paper Structure (22 sections, 25 equations, 11 figures, 7 tables)

This paper contains 22 sections, 25 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Details of the proposed neural speech phase prediction model. Here, RCNet, CONV, STFT, DF, DT, Re, Im and $\Phi$ represent the residual convolutional network, linear convolutional layer, short-time Fourier transform, differential along frequency axis, differential along time axis, real part calculation, imaginary part calculation and phase calculation formula, respectively. Gray parts do not appear during generation.
  • Figure 2: Details of the residual convolutional network and the training procedure of low-latency streamable neural speech phase prediction model through knowledge distillation. Here, subfigure (a) represents a non-causal teacher model which is consistent with Figure \ref{['fig: Phase_model']}. Subfigure (b) represents a causal student model. RCNet, CONV, DCONV and $\Phi$ represent the residual convolutional network, linear convolutional layer, linear dilated convolutional layer and phase calculation formula, respectively. $k_*$ and $d_{*,*}$ denotes kernel size and dilation factor, respectively.
  • Figure 3: An illustration explanation of the error expansion issue caused by phase wrapping.
  • Figure 4: Graphs of five typical anti-wrapping functions, including (a) linear function; (b) logarithmic function; (c) cubic function; (d) parabolic function and (e) cosine function.
  • Figure 5: A simple flowchart of the analysis-synthesis task, BWE task and SS task. Here, Concat and ISTFT represent concatenation and inverse short-time Fourier transform, respectively.
  • ...and 6 more figures