Table of Contents
Fetching ...

Improving Design of Input Condition Invariant Speech Enhancement

Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

TL;DR

The paper tackles the problem of universal speech enhancement across arbitrary input conditions (varying duration, sampling rate, and microphone configurations). It introduces USES2, a decoupled framework with separate single-channel and multi-channel processing, featuring two time-frequency modeling variants (USES2-Swin and USES2-Comp), and a novel channel modeling module TA_ttC, complemented by a two-stage training strategy to boost efficiency. Empirical results across five public corpora show that USES2 substantially improves real-condition performance (e.g., DNSMOS and WER) while preserving simulated-condition performance and reducing model size and compute compared to prior USES. This work advances toward universal SE by enabling robust multi-channel processing in real environments and provides reproducibility through available full-model details on GitHub.

Abstract

Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet.

Improving Design of Input Condition Invariant Speech Enhancement

TL;DR

The paper tackles the problem of universal speech enhancement across arbitrary input conditions (varying duration, sampling rate, and microphone configurations). It introduces USES2, a decoupled framework with separate single-channel and multi-channel processing, featuring two time-frequency modeling variants (USES2-Swin and USES2-Comp), and a novel channel modeling module TA_ttC, complemented by a two-stage training strategy to boost efficiency. Empirical results across five public corpora show that USES2 substantially improves real-condition performance (e.g., DNSMOS and WER) while preserving simulated-condition performance and reducing model size and compute compared to prior USES. This work advances toward universal SE by enabling robust multi-channel processing in real environments and provides reproducibility through available full-model details on GitHub.

Abstract

Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet.
Paper Structure (11 sections, 2 equations, 2 figures, 3 tables)

This paper contains 11 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The proposed \ref{['symbol:input_condition_invariant']} SE model, USES2. Kernel size and feature maps of convolution layers are in gray.
  • Figure 2: Speech enhancement and recognition performance in diverse simulated conditions, averaged over the five corpora listed in Table \ref{['tab:corpora']}. The models are trained only on 8 kHz data (via downsampling), and tested with the original sampling frequencies. "USES", " excl ", and "U2-C" denote the SE model proposed in Toward-Zhang2023, corpus-exclusive SE, and the proposed \ref{['symbol:USES2-Comp']} model, respectively. We follow the same training and evaluation configurations as in Toward-Zhang2023 to make the results comparable.