Table of Contents
Fetching ...

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

TL;DR

This work addresses the pivotal role of phase in speech enhancement by introducing MP-SENet, a TF-domain, encoder–decoder network that explicitly estimates magnitude and wrapped phase in parallel. It leverages Transformer-based TF-Transformers to capture time and frequency dependencies and uses parallel magnitude and phase decoders, along with anti-wrapping phase losses and STFT consistency, to mitigate magnitude–phase compensation. A MetricGAN-based discriminator guides perceptual quality, enabling state-of-the-art performance across denoising, dereverberation, and bandwidth extension within a unified framework. The results demonstrate improved perceptual quality and harmonic integrity, highlighting the practical impact of explicit phase modeling for real-time, high-quality speech enhancement.

Abstract

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

TL;DR

This work addresses the pivotal role of phase in speech enhancement by introducing MP-SENet, a TF-domain, encoder–decoder network that explicitly estimates magnitude and wrapped phase in parallel. It leverages Transformer-based TF-Transformers to capture time and frequency dependencies and uses parallel magnitude and phase decoders, along with anti-wrapping phase losses and STFT consistency, to mitigate magnitude–phase compensation. A MetricGAN-based discriminator guides perceptual quality, enabling state-of-the-art performance across denoising, dereverberation, and bandwidth extension within a unified framework. The results demonstrate improved perceptual quality and harmonic integrity, highlighting the practical impact of explicit phase modeling for real-time, high-quality speech enhancement.

Abstract

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.
Paper Structure (37 sections, 6 equations, 7 figures, 4 tables)

This paper contains 37 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overall structure of the proposed MP-SENet for three SE tasks, with the denoising task used as an illustration. The "$Abs( \cdot )$" and "$Angle( \cdot )$" represent the magnitude and phase calculation functions, respectively. The "$( \cdot )^c$" and "$( \cdot )^{1/c}$" denote the magnitude compression and decompression operations, respectively. The "Concat" denotes the spectral concatenation operation and the "Arctan2" denotes the two-argument arc-tangent function.
  • Figure 2: The diagram of the TF-Transformer block used in Fig \ref{['fig: model']}, where $B$, $C$, $T$, and $F'$ represent the batch size, the number of channels, the number of frames, and the number of frequency bins, respectively. The TF-Transfomer block is a cascade of a Time-Transformer and a Freq.-Transformer, both of which share the same architecture with inputs of different shapes.
  • Figure 3: Spectrogram visualization of the noisy speech, clean speech, and speech waveforms denoised by SOTA baseline methods and our proposed MP-SENet.
  • Figure 4: WB-PESQ, PD, and ViSQOL metrics of the noisy speech and speech waveforms enhanced by DB-AIAT, CMGAN, and our proposed MP-SENet on the re-noised Voice Bank test set with varying SNR conditions.
  • Figure 5: Spectrogram visualization of the noisy speech, clean speech, and speech waveforms enhanced our proposed MP-SENet with different combinations of phase optimization approaches on the VoiceBank+DEMAND dataset.
  • ...and 2 more figures