Table of Contents
Fetching ...

DM: Dual-path Magnitude Network for General Speech Restoration

Da-Hee Yang, Dail Kim, Joon-Hyuk Chang, Jeonghwan Choi, Han-gil Moon

TL;DR

The paper tackles general speech restoration under simultaneous distortions by introducing the Dual-path Magnitude (DM) network, which employs two parallel magnitude decoders—one masking-based for suppression and one mapping-based for generation—sharing parameters and connected via a skip from the masking path to the mapping path. A learnable skip scaling parameter $\alpha$ and a fusion weight $\omega$ enable effective enhancement and generation, with the input modeled as $y = h(x*r) + n$ and the goal of recovering $x$. The authors demonstrate that the DM network achieves superior performance with only $2.05$ million parameters, outperforming baselines like VoiceFixer, HD-DEMUCS, and SGMSE+ across noise, reverberation, and bandwidth restoration, supported by ablations that highlight the benefits of parameter sharing and the skip mechanism. This work offers a compact, robust approach to multi-distortion speech processing and provides a practical benchmark for real-world restoration tasks.

Abstract

In this paper, we introduce a novel general speech restoration model: the Dual-path Magnitude (DM) network, designed to address multiple distortions including noise, reverberation, and bandwidth degradation effectively. The DM network employs dual parallel magnitude decoders that share parameters: one uses a masking-based algorithm for distortion removal and the other employs a mapping-based approach for speech restoration. A novel aspect of the DM network is the integration of the magnitude spectrogram output from the masking decoder into the mapping decoder through a skip connection, enhancing the overall restoration capability. This integrated approach overcomes the inherent limitations observed in previous models, as detailed in a step-by-step analysis. The experimental results demonstrate that the DM network outperforms other baseline models in the comprehensive aspect of general speech restoration, achieving substantial restoration with fewer parameters.

DM: Dual-path Magnitude Network for General Speech Restoration

TL;DR

The paper tackles general speech restoration under simultaneous distortions by introducing the Dual-path Magnitude (DM) network, which employs two parallel magnitude decoders—one masking-based for suppression and one mapping-based for generation—sharing parameters and connected via a skip from the masking path to the mapping path. A learnable skip scaling parameter and a fusion weight enable effective enhancement and generation, with the input modeled as and the goal of recovering . The authors demonstrate that the DM network achieves superior performance with only million parameters, outperforming baselines like VoiceFixer, HD-DEMUCS, and SGMSE+ across noise, reverberation, and bandwidth restoration, supported by ablations that highlight the benefits of parameter sharing and the skip mechanism. This work offers a compact, robust approach to multi-distortion speech processing and provides a practical benchmark for real-world restoration tasks.

Abstract

In this paper, we introduce a novel general speech restoration model: the Dual-path Magnitude (DM) network, designed to address multiple distortions including noise, reverberation, and bandwidth degradation effectively. The DM network employs dual parallel magnitude decoders that share parameters: one uses a masking-based algorithm for distortion removal and the other employs a mapping-based approach for speech restoration. A novel aspect of the DM network is the integration of the magnitude spectrogram output from the masking decoder into the mapping decoder through a skip connection, enhancing the overall restoration capability. This integrated approach overcomes the inherent limitations observed in previous models, as detailed in a step-by-step analysis. The experimental results demonstrate that the DM network outperforms other baseline models in the comprehensive aspect of general speech restoration, achieving substantial restoration with fewer parameters.
Paper Structure (14 sections, 2 figures, 1 table)

This paper contains 14 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Step-by-step modeling architecture of single-network (S1 and S2), unified network (U1), and proposed network (DM1 and DM2). DM1 is a model without skip connections, while DM2 incorporates skip connections with $\alpha$.
  • Figure 2: Comparison of spectrograms depicting the baseline, step-wise integration process, and proposed models. In the S1 model, although the noise and reverberation removal performance was excellent, bandwidth generation was not achieved. Generative models such as VoiceFixer and SGMSE+ produced more natural spectrograms, but the overall performance was inferior. The HD-DEMUCS model exhibited less effective noise and reverberation removal in the low-frequency band and generated excessive artifacts in the high-frequency band. Conversely, the proposed DM1 and DM2 models demonstrated satisfactory noise and reverberation removal performance, along with effective bandwidth generation.