Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio
Li Li, Shogo Seki
TL;DR
The paper investigates domain adaptation-based speech enhancement methods RemixIT and Remixed2Remixed, showing that imbalanced $SNR$ distributions in pseudo-paired data can hinder student model performance on real-world recordings. It introduces an $SNR$ control module (SNRCM) to sample $SNR$ from balanced distributions and employs curriculum learning (CL) to span progressively broader $SNR$ ranges, improving robustness to underrepresented data. Experimental results on CHiME-7 UDASE–utilizing the CHiME-5/Librispeech-based datasets–demonstrate SI-SDR gains for RemixIT and competitive gains for Re2Re, though PESQ and STOI benefits vary. Overall, the work advances DASE by addressing data imbalance in remixing, offering a practical approach to enhance model generalization in real-world, multi-SNR conditions.
Abstract
RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environments, the dataset inevitably suffers data imbalance in some acoustic properties, leading to subpar performance for the underrepresented data. The signal-to-noise ratio (SNR), inherently balanced in supervised learning, is a prime example. In this paper, we provide empirical evidence that the SNR of pseudo data has a significant impact on model performance using the dataset of the CHiME-7 UDASE task, highlighting the importance of balanced SNR in DASE. Furthermore, we propose adopting curriculum learning to encompass a broad range of SNRs to boost performance for underrepresented data.
