Table of Contents
Fetching ...

Toward Universal Speech Enhancement for Diverse Input Conditions

Wangyou Zhang, Kohei Saijo, Zhong-Qiu Wang, Shinji Watanabe, Yanmin Qian

TL;DR

The paper tackles the problem of universal speech enhancement by proposing USES, a single, unconstrained architecture built on TFPSNet elements that can operate across diverse input conditions without task-specific retraining. It achieves sampling-frequency-independence via fixed-duration STFT and transformer-based TF modeling, microphone-channel-independence via a channel-attention fusion (TAC) scheme, and long-sequence handling through learnable memory tokens that prefix feature sequences. Evaluations on a wide, multi-corpus benchmark demonstrate competitive SE and separation performance across single/multi-channel, various sampling rates, reverberant and anechoic settings, and real-world data, with memory tokens and SF diversity improving robustness. The work offers a practical path toward a universal SE solution with potential downstream gains for ASR and speech translation, and provides a benchmark framework to spur further research in universal SE.

Abstract

The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance.

Toward Universal Speech Enhancement for Diverse Input Conditions

TL;DR

The paper tackles the problem of universal speech enhancement by proposing USES, a single, unconstrained architecture built on TFPSNet elements that can operate across diverse input conditions without task-specific retraining. It achieves sampling-frequency-independence via fixed-duration STFT and transformer-based TF modeling, microphone-channel-independence via a channel-attention fusion (TAC) scheme, and long-sequence handling through learnable memory tokens that prefix feature sequences. Evaluations on a wide, multi-corpus benchmark demonstrate competitive SE and separation performance across single/multi-channel, various sampling rates, reverberant and anechoic settings, and real-world data, with memory tokens and SF diversity improving robustness. The work offers a practical path toward a universal SE solution with potential downstream gains for ASR and speech translation, and provides a benchmark framework to spur further research in universal SE.

Abstract

The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance.
Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the proposed versatile SE model. The kernel size and feature maps of convolutional layers are annotated in gray.
  • Figure 2: STFT with fixed-duration window and hop sizes (e.g., 32 ms and 16 ms) will generate spectra with the same frequency and temporal resolution for different sampling frequencies.
  • Figure 3: Memory token-based long sequence modeling.