Table of Contents
Fetching ...

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang

TL;DR

The paper tackles domain mismatch in cross-domain speech enhancement by introducing NADA-GAN, a noise-aware data-simulation framework that leverages a BEATs-based noise encoder to extract target-domain noise embeddings and a generator conditioned to synthesize target-like noisy speech from clean source speech. It adds FiLM conditioning, patch-wise contrastive learning, and a noise reconstruction objective, plus dynamic stochastic perturbation during inference to improve generalization to unseen noise. Evaluations on the VoiceBank-DEMAND dataset using a DEMUCS SE backbone show that NADA-GAN consistently outperforms a strong UNA-GAN baseline on PESQ and STOI, and yields higher MOS in human judgments, while ablations confirm the crucial role of noise embeddings and perturbation. The approach enables data-efficient domain adaptation for SE and holds promise for extending to ASR and other robust speech-processing tasks in diverse environments.

Abstract

Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

TL;DR

The paper tackles domain mismatch in cross-domain speech enhancement by introducing NADA-GAN, a noise-aware data-simulation framework that leverages a BEATs-based noise encoder to extract target-domain noise embeddings and a generator conditioned to synthesize target-like noisy speech from clean source speech. It adds FiLM conditioning, patch-wise contrastive learning, and a noise reconstruction objective, plus dynamic stochastic perturbation during inference to improve generalization to unseen noise. Evaluations on the VoiceBank-DEMAND dataset using a DEMUCS SE backbone show that NADA-GAN consistently outperforms a strong UNA-GAN baseline on PESQ and STOI, and yields higher MOS in human judgments, while ablations confirm the crucial role of noise embeddings and perturbation. The approach enables data-efficient domain adaptation for SE and holds promise for extending to ASR and other robust speech-processing tasks in diverse environments.

Abstract

Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
Paper Structure (22 sections, 6 equations, 4 figures, 4 tables)

This paper contains 22 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The architecture of our proposed method, NADA-GAN. The dotted arrows indicate that during the training phase, simulated speech $\mathbf{X}^G$ is used together with target noisy speech $\mathbf{X}^T$ to 1) train the discriminator, and 2) contribute to noise information reconstruction. The $\bigoplus$ operator denotes element-wise tensor addition.
  • Figure 2: PESQ and STOI results of dynamic stochastic perturbation w.r.t. various standard deviations.
  • Figure 3: The t-SNE visualization of noise embeddings extracted from unseen non-stationary noise categories, i.e., "Bus", "Cafe", "Living", "Office", and "Psquare".
  • Figure 4: SNR distributions w.r.t. the simulated speech (11,572 utt.), target noisy speech (40 utt.) used for GAN training, and test speech (824 utt.) of VoiceBank-DEMAND.