Table of Contents
Fetching ...

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling

TL;DR

The paper tackles speech phase prediction by introducing SP-NSPP, a stage-wise framework that first generates a coarse prior phase from the amplitude spectrum and then refines it conditioned on that prior. The model leverages ConvNeXt v2 backbones, phase spectrum adversarial training via a PSD, and a time-frequency integrated difference (TFID) loss to enforce phase continuity, achieving higher phase accuracy and superior speech quality with efficient generation compared to iterative methods. Key contributions include the explicit prior-construction stage, the two-stage refinement, adversarial phase training, and TFID-based continuity, validated on VCTK with strong generalization to higher sampling rates and non-speech data. The approach significantly improves synthesis quality while reducing computational cost, making it practical for real-time applications in speech generation tasks.

Abstract

This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.

Stage-Wise and Prior-Aware Neural Speech Phase Prediction

TL;DR

The paper tackles speech phase prediction by introducing SP-NSPP, a stage-wise framework that first generates a coarse prior phase from the amplitude spectrum and then refines it conditioned on that prior. The model leverages ConvNeXt v2 backbones, phase spectrum adversarial training via a PSD, and a time-frequency integrated difference (TFID) loss to enforce phase continuity, achieving higher phase accuracy and superior speech quality with efficient generation compared to iterative methods. Key contributions include the explicit prior-construction stage, the two-stage refinement, adversarial phase training, and TFID-based continuity, validated on VCTK with strong generalization to higher sampling rates and non-speech data. The approach significantly improves synthesis quality while reducing computational cost, making it practical for real-time applications in speech generation tasks.

Abstract

This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.
Paper Structure (18 sections, 13 equations, 5 figures, 4 tables)

This paper contains 18 sections, 13 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Details of the model structure of the proposed SP-NSPP. Here, Conv1D, LN, PEA, GELU, GRN and $\Phi$ represent the 1D convolutional layer, layer normalization layer, parallel estimation architecture, Gaussian error linear unit, global response normalization and phase calculation formula, respectively.
  • Figure 2: Details of the training losses of the proposed SP-NSPP. Here, Conv2D and LRELU represent the 2D convolutional layer and leaky rectified linear unit, respectively.
  • Figure 3: A comparison among the spectrograms (0$\sim$4 kHz) of the natural speech and speeches generated by NSPP and SP-NSPP for the analysis-synthesis task.
  • Figure 4: A comparison among the spectrograms (0$\sim$4 kHz) of the natural speech and speeches generated by SP-NSPP and SP-NSPP w/o PSD for the analysis-synthesis task.
  • Figure 5: Curves of PESQ and model size of the SP-NSPP as a function of the number of iterations for the analysis-synthesis task.