Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Fei Liu; Yang Ai; Hui-Peng Du; Ye-Xin Lu; Rui-Chen Zheng; Zhen-Hua Ling

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling

TL;DR

The paper tackles speech phase prediction by introducing SP-NSPP, a stage-wise framework that first generates a coarse prior phase from the amplitude spectrum and then refines it conditioned on that prior. The model leverages ConvNeXt v2 backbones, phase spectrum adversarial training via a PSD, and a time-frequency integrated difference (TFID) loss to enforce phase continuity, achieving higher phase accuracy and superior speech quality with efficient generation compared to iterative methods. Key contributions include the explicit prior-construction stage, the two-stage refinement, adversarial phase training, and TFID-based continuity, validated on VCTK with strong generalization to higher sampling rates and non-speech data. The approach significantly improves synthesis quality while reducing computational cost, making it practical for real-time applications in speech generation tasks.

Abstract

This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 5 figures, 4 tables)

This paper contains 18 sections, 13 equations, 5 figures, 4 tables.

Introduction
PROPOSED METHOD
Overview
Model Structure
Training Criteria
Training Criteria of Prior Construction Model
Training Criteria of Refinement Model
Optional Iterative Prediction Mode
Experiments and Results
Data and Feature Configuration
Task Definitions
Model Details
Evaluation Metrics
Primary Experimental Results
Ablation Studies
...and 3 more sections

Figures (5)

Figure 1: Details of the model structure of the proposed SP-NSPP. Here, Conv1D, LN, PEA, GELU, GRN and $\Phi$ represent the 1D convolutional layer, layer normalization layer, parallel estimation architecture, Gaussian error linear unit, global response normalization and phase calculation formula, respectively.
Figure 2: Details of the training losses of the proposed SP-NSPP. Here, Conv2D and LRELU represent the 2D convolutional layer and leaky rectified linear unit, respectively.
Figure 3: A comparison among the spectrograms (0$\sim$4 kHz) of the natural speech and speeches generated by NSPP and SP-NSPP for the analysis-synthesis task.
Figure 4: A comparison among the spectrograms (0$\sim$4 kHz) of the natural speech and speeches generated by SP-NSPP and SP-NSPP w/o PSD for the analysis-synthesis task.
Figure 5: Curves of PESQ and model size of the SP-NSPP as a function of the number of iterations for the analysis-synthesis task.

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

TL;DR

Abstract

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)