Table of Contents
Fetching ...

Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting

Alimjan Mattursun, Liejun Wang, Yinfeng Yu, Chunyang Ma

TL;DR

BSP-MPNet tackles speech enhancement under challenging noisy and reverberant conditions by integrating self-supervised embeddings with magnitude-phase information in a dual-path framework. The method introduces a Magnitude-Phase 2D Coarse (MP-2DC) encoder, a Feature-Separated SSL (FS-SSL) module to split magnitude and phase features, and a RNN-Enhanced Multi-Attention (REMA) mask decoder with Time-Frequency Attention to reconstruct the speech. Key contributions include (i) cross-domain fusion of FS-SSL with PCS-boosted spectra, (ii) decoupled magnitude and phase SSL features with adaptive weighting, and (iii) a multi-component masking mechanism that improves both magnitude and phase recovery. Experiments on VoiceBank+DEMAND and WHAMR! show BSP-MPNet achieving superior or competitive performance across PESQ, STOI, SI-SNR, and DNSMOS, underscoring the value of phase-aware SSL integration for SE and suggesting routes for efficiency via distillation in future work.

Abstract

Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{https://github.com/AlimMat/BSP-MPNet. \label{s1}}

Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting

TL;DR

BSP-MPNet tackles speech enhancement under challenging noisy and reverberant conditions by integrating self-supervised embeddings with magnitude-phase information in a dual-path framework. The method introduces a Magnitude-Phase 2D Coarse (MP-2DC) encoder, a Feature-Separated SSL (FS-SSL) module to split magnitude and phase features, and a RNN-Enhanced Multi-Attention (REMA) mask decoder with Time-Frequency Attention to reconstruct the speech. Key contributions include (i) cross-domain fusion of FS-SSL with PCS-boosted spectra, (ii) decoupled magnitude and phase SSL features with adaptive weighting, and (iii) a multi-component masking mechanism that improves both magnitude and phase recovery. Experiments on VoiceBank+DEMAND and WHAMR! show BSP-MPNet achieving superior or competitive performance across PESQ, STOI, SI-SNR, and DNSMOS, underscoring the value of phase-aware SSL integration for SE and suggesting routes for efficiency via distillation in future work.

Abstract

Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]{https://github.com/AlimMat/BSP-MPNet. \label{s1}}

Paper Structure

This paper contains 19 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The speech enhancement network based on self-supervised embedding. (a) is a general method. (b) is the boosting method added to the magnitude spectrum. (c) is our proposed magnitude-phase dual-path enhancement method.
  • Figure 2: Architecture of the proposed BSP-MPNet. "STFT" represents the short-time fourier transform of the speech.
  • Figure 3: Structure of Time-Frequency Attention (TFA) Module.
  • Figure 4: The Matplotlib visualization analysis. (a) is 300 random samples from the VoiceBank+DEMAND test set were selected to compare the PESQ scores of BSP-MPNet with two baseline methods. (b) is An analysis of the weights corresponding to each Transformer layer in different SSL models.