Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Yang Ai; Zhen-Hua Ling

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Yang Ai, Zhen-Hua Ling

TL;DR

A novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra only via neural networks, and is the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Abstract

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

TL;DR

Abstract

Paper Structure (22 sections, 25 equations, 11 figures, 7 tables)

This paper contains 22 sections, 25 equations, 11 figures, 7 tables.

Introduction
Related Works
Griffin-Lim Algorithm (GLA)
Relaxed Averaged Alternating Reflection (RAAR)
Von Mises Distribution DNN-based Method
Proposed Method
Model Structure
Training Criteria
Low-Latency Streamable Phase Prediction by Causal Convolution and Knowledge Distillation
Experiments
Data and Feature Configuration
Speech Generation Tasks
Analysis-Synthesis Task
BWE Task
SS Task
...and 7 more sections

Figures (11)

Figure 1: Details of the proposed neural speech phase prediction model. Here, RCNet, CONV, STFT, DF, DT, Re, Im and $\Phi$ represent the residual convolutional network, linear convolutional layer, short-time Fourier transform, differential along frequency axis, differential along time axis, real part calculation, imaginary part calculation and phase calculation formula, respectively. Gray parts do not appear during generation.
Figure 2: Details of the residual convolutional network and the training procedure of low-latency streamable neural speech phase prediction model through knowledge distillation. Here, subfigure (a) represents a non-causal teacher model which is consistent with Figure \ref{['fig: Phase_model']}. Subfigure (b) represents a causal student model. RCNet, CONV, DCONV and $\Phi$ represent the residual convolutional network, linear convolutional layer, linear dilated convolutional layer and phase calculation formula, respectively. $k_*$ and $d_{*,*}$ denotes kernel size and dilation factor, respectively.
Figure 3: An illustration explanation of the error expansion issue caused by phase wrapping.
Figure 4: Graphs of five typical anti-wrapping functions, including (a) linear function; (b) logarithmic function; (c) cubic function; (d) parabolic function and (e) cosine function.
Figure 5: A simple flowchart of the analysis-synthesis task, BWE task and SS task. Here, Concat and ISTFT represent concatenation and inverse short-time Fourier transform, respectively.
...and 6 more figures

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

TL;DR

Abstract

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)