Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

Li Li; Shogo Seki

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

Li Li, Shogo Seki

TL;DR

The paper investigates domain adaptation-based speech enhancement methods RemixIT and Remixed2Remixed, showing that imbalanced $SNR$ distributions in pseudo-paired data can hinder student model performance on real-world recordings. It introduces an $SNR$ control module (SNRCM) to sample $SNR$ from balanced distributions and employs curriculum learning (CL) to span progressively broader $SNR$ ranges, improving robustness to underrepresented data. Experimental results on CHiME-7 UDASE–utilizing the CHiME-5/Librispeech-based datasets–demonstrate SI-SDR gains for RemixIT and competitive gains for Re2Re, though PESQ and STOI benefits vary. Overall, the work advances DASE by addressing data imbalance in remixing, offering a practical approach to enhance model generalization in real-world, multi-SNR conditions.

Abstract

RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environments, the dataset inevitably suffers data imbalance in some acoustic properties, leading to subpar performance for the underrepresented data. The signal-to-noise ratio (SNR), inherently balanced in supervised learning, is a prime example. In this paper, we provide empirical evidence that the SNR of pseudo data has a significant impact on model performance using the dataset of the CHiME-7 UDASE task, highlighting the importance of balanced SNR in DASE. Furthermore, we propose adopting curriculum learning to encompass a broad range of SNRs to boost performance for underrepresented data.

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

TL;DR

The paper investigates domain adaptation-based speech enhancement methods RemixIT and Remixed2Remixed, showing that imbalanced

distributions in pseudo-paired data can hinder student model performance on real-world recordings. It introduces an

control module (SNRCM) to sample

from balanced distributions and employs curriculum learning (CL) to span progressively broader

ranges, improving robustness to underrepresented data. Experimental results on CHiME-7 UDASE–utilizing the CHiME-5/Librispeech-based datasets–demonstrate SI-SDR gains for RemixIT and competitive gains for Re2Re, though PESQ and STOI benefits vary. Overall, the work advances DASE by addressing data imbalance in remixing, offering a practical approach to enhance model generalization in real-world, multi-SNR conditions.

Abstract

Paper Structure (14 sections, 3 equations, 4 figures, 1 table)

This paper contains 14 sections, 3 equations, 4 figures, 1 table.

Introduction
Remixing-based DASE
Common training strategy
RemixIT
Remixed2Remixed
Imbalanced dataset analysis
Brief introduction to the UDASE training dataset
SNR distributions of original and remixed datasets
SNR-aware remixing
Experimental evaluations
Evaluation dataset and metrics
Model architecture and training settings
Experimental results and discussions
Conclusions

Figures (4)

Figure 1: Estimated SNR distributions for CHiME-5 training dataset w/o VAD (left) and w/ VAD (right).
Figure 2: Measured SNR distributions for datasets generated by the remixing process in RemixIT (1st row) and Re2Re (2nd row), respectively. The left and right columns correspond to models trained on CHiME-5 w/o VAD and w/ VAD, respectively.
Figure 3: Flowcharts of (a) remixing without SNRCM, (b) remixing with SNRCM using predefined SNR distribution, and (c) remixing with SNRCM and CL that extends the range of SNR distribution in each training stage.
Figure 4: SI-SDR improvement [dB] achieved by RemixIT (top) and Re2Re (bottom). Models were trained with CHiME-5 w/o VAD (left) and w/ VAD (right), respectively. The red lines represent the median values, and the red triangle marks indicate the mean values. Teacher and student models were initialized using the checkpoint provided by the CHiME-7 UDASE task.

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

TL;DR

Abstract

Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

Authors

TL;DR

Abstract

Table of Contents

Figures (4)