The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement
Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer, John R. Hershey
TL;DR
The paper presents CHiME-7 UDASE, a task focused on unsupervised domain adaptation for conversational speech enhancement to bridge the gap between supervised models trained on synthetic data and real-world target domains lacking clean references. It leverages unlabeled CHiME-5 in-domain data for adaptation while using LibriMix as the out-of-domain labeled source and a close-to-domain Reverberant LibriCHiME-5 dataset for evaluation, enabling objective and subjective assessments. The baseline RemixIT framework demonstrates that unsupervised adaptation can yield improvements on close-to-domain data (notably Reverberant LibriCHiME-5) even when trained primarily on OOD data, with variants like RemixIT-VAD achieving the best performance in some metrics. The work establishes a practical benchmark, supporting the development of robust, ecologically valid speech enhancement methods for real-world conversational settings, and outlines an evaluation pipeline including SI-SDR, DNSMOS, and P.835 listening tests.
Abstract
Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. This paper introduces the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target domain for unsupervised domain adaptation of speech enhancement models. The target domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is unavailable. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system.
