Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang
TL;DR
The paper tackles domain mismatch in cross-domain speech enhancement by introducing NADA-GAN, a noise-aware data-simulation framework that leverages a BEATs-based noise encoder to extract target-domain noise embeddings and a generator conditioned to synthesize target-like noisy speech from clean source speech. It adds FiLM conditioning, patch-wise contrastive learning, and a noise reconstruction objective, plus dynamic stochastic perturbation during inference to improve generalization to unseen noise. Evaluations on the VoiceBank-DEMAND dataset using a DEMUCS SE backbone show that NADA-GAN consistently outperforms a strong UNA-GAN baseline on PESQ and STOI, and yields higher MOS in human judgments, while ablations confirm the crucial role of noise embeddings and perturbation. The approach enables data-efficient domain adaptation for SE and holds promise for extending to ASR and other robust speech-processing tasks in diverse environments.
Abstract
Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
