Stage-Wise and Prior-Aware Neural Speech Phase Prediction
Fei Liu, Yang Ai, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, Zhen-Hua Ling
TL;DR
The paper tackles speech phase prediction by introducing SP-NSPP, a stage-wise framework that first generates a coarse prior phase from the amplitude spectrum and then refines it conditioned on that prior. The model leverages ConvNeXt v2 backbones, phase spectrum adversarial training via a PSD, and a time-frequency integrated difference (TFID) loss to enforce phase continuity, achieving higher phase accuracy and superior speech quality with efficient generation compared to iterative methods. Key contributions include the explicit prior-construction stage, the two-stage refinement, adversarial phase training, and TFID-based continuity, validated on VCTK with strong generalization to higher sampling rates and non-speech data. The approach significantly improves synthesis quality while reducing computational cost, making it practical for real-time applications in speech generation tasks.
Abstract
This paper proposes a novel Stage-wise and Prior-aware Neural Speech Phase Prediction (SP-NSPP) model, which predicts the phase spectrum from input amplitude spectrum by two-stage neural networks. In the initial prior-construction stage, we preliminarily predict a rough prior phase spectrum from the amplitude spectrum. The subsequent refinement stage transforms the amplitude spectrum into a refined high-quality phase spectrum conditioned on the prior phase. Networks in both stages use ConvNeXt v2 blocks as the backbone and adopt adversarial training by innovatively introducing a phase spectrum discriminator (PSD). To further improve the continuity of the refined phase, we also incorporate a time-frequency integrated difference (TFID) loss in the refinement stage. Experimental results confirm that, compared to neural network-based no-prior phase prediction methods, the proposed SP-NSPP achieves higher phase prediction accuracy, thanks to introducing the coarse phase priors and diverse training criteria. Compared to iterative phase estimation algorithms, our proposed SP-NSPP does not require multiple rounds of staged iterations, resulting in higher generation efficiency.
