Table of Contents
Fetching ...

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

Chengzhong Wang, Andong Li, Dingding Yao, Junfeng Li

TL;DR

This work tackles the long-standing challenge of phase modeling in speech enhancement by enforcing Global Rotation Equivariance (GRE) to respect the circular topology of phase. It introduces a dual-stream Magnitude-Phase framework with the Magnitude-Phase Interactive Convolution Module (MPICM) and Hybrid-Attention Dual-FFN (HADF) to enable constrained yet expressive cross-stream interaction. Empirical results across phase retrieval, denoising, dereverberation, and bandwidth extension demonstrate improved phase accuracy (e.g., reduced Phase Distance) and perceptual quality (PESQ) with fewer parameters than strong baselines. The findings highlight GRE as a powerful inductive bias for phase modeling, offering robust performance in universal SE tasks and better generalization to unseen acoustics.

Abstract

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

TL;DR

This work tackles the long-standing challenge of phase modeling in speech enhancement by enforcing Global Rotation Equivariance (GRE) to respect the circular topology of phase. It introduces a dual-stream Magnitude-Phase framework with the Magnitude-Phase Interactive Convolution Module (MPICM) and Hybrid-Attention Dual-FFN (HADF) to enable constrained yet expressive cross-stream interaction. Empirical results across phase retrieval, denoising, dereverberation, and bandwidth extension demonstrate improved phase accuracy (e.g., reduced Phase Distance) and perceptual quality (PESQ) with fewer parameters than strong baselines. The findings highlight GRE as a powerful inductive bias for phase modeling, offering robust performance in universal SE tasks and better generalization to unseen acoustics.

Abstract

While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.
Paper Structure (31 sections, 16 equations, 6 figures, 9 tables)

This paper contains 31 sections, 16 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the proposed network architecture. (a) The dual-stream encoder-decoder topology. The R-Conv and C-Conv denote real-valued and complex-valued convolution respectively. (b) Structure of the Magnitude-Phase Dilated DenseNet, illustrating the aligned channel concatenation. (c) Signal flow within the dual-path Hybrid-Attention Dual-FFN (HADF) bottleneck.
  • Figure 2: Detailed structure of the MPICM block, including the magnitude and phase dual-streams and their interaction via the gating mechanism.
  • Figure 3: Detailed architecture of the Hybrid-Attention Dual-FFN (HADF) module. (Top Left) The macroscopic residual block structure. (Bottom) The Hybrid Attention mechanism, illustrating the projection of complex queries/keys into a unified attention map. (Top Right) The distinct feed-forward networks for the magnitude (Mag-FFN) and phase (Pha-FFN) streams.
  • Figure 4: Performance Comparison across varying SNRs. Models were trained on the DNS-2020 corpus and evaluated on re-mixed versions of the VoiceBank+DEMAND test set ranging from -10 dB to 15 dB.
  • Figure 5: Spectrogram visualization of enhanced speech under diverse distortion scenarios. The audio files are taken from WSJ0+WHAMR! test set.
  • ...and 1 more figures