Table of Contents
Fetching ...

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, Léonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker

TL;DR

The paper tackles the mismatch between fully supervised speech enhancement trained on synthetic data and real-world test conditions by introducing the CHiME-7 UDASE task, which leverages unlabeled CHiME-5 in-domain data for unsupervised adaptation. It benchmarks four systems across objective metrics (intrusive SI-SDR, PESQ, STOI and nonintrusive DNSMOS, TorchAudio-Squim) and a rigorous ITU-T P.835 listening test to assess both noise suppression and distortion. A key finding is the weak-to-mixed correlation between nonintrusive metrics and subjective quality, while intrusive metrics on the close-to-domain Reverberant LibriCHiME-5 offer a more reliable proxy for in-domain performance. The results underscore the difficulty of CHiME-7 UDASE, with only one system improving overall quality subjectively, and advocate using Reverberant LibriCHiME-5 for objective in-domain evaluation and exercising caution with nonintrusive metrics for domain-adaptation studies.

Abstract

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

TL;DR

The paper tackles the mismatch between fully supervised speech enhancement trained on synthetic data and real-world test conditions by introducing the CHiME-7 UDASE task, which leverages unlabeled CHiME-5 in-domain data for unsupervised adaptation. It benchmarks four systems across objective metrics (intrusive SI-SDR, PESQ, STOI and nonintrusive DNSMOS, TorchAudio-Squim) and a rigorous ITU-T P.835 listening test to assess both noise suppression and distortion. A key finding is the weak-to-mixed correlation between nonintrusive metrics and subjective quality, while intrusive metrics on the close-to-domain Reverberant LibriCHiME-5 offer a more reliable proxy for in-domain performance. The results underscore the difficulty of CHiME-7 UDASE, with only one system improving overall quality subjectively, and advocate using Reverberant LibriCHiME-5 for objective in-domain evaluation and exercising caution with nonintrusive metrics for domain-adaptation studies.

Abstract

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.
Paper Structure (33 sections, 6 figures, 4 tables)

This paper contains 33 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Timeline of the listening experiment. The listening test includes several listening sessions. Each session is made of several trials, and each trial consists of three presentations of the same sound sample. For each presentation, the participant has to give a rating. In this figure, the rating scale order (BAK-SIG-OVRL or SIG-BAK-OVRL) is indicated for each session. The change of the rating scale order in the middle of the session is specific to the practice session.
  • Figure 2: Pairwise Pearson correlation coefficient of the objective performance metrics.
  • Figure 3: Mean results (dots) and 95% confidence intervals (bars) for the anchoring phase.
  • Figure 4: Boxplots, violin plots, and mean results for the subjective BAK (\ref{['fig:listening_test_BAK']}), SIG (\ref{['fig:listening_test_SIG']}), and OVRL (\ref{['fig:listening_test_OVRL']}) mean opinion scores of the ITU-T P.835 listening test. Black dots and numbers above the box/violin plots correspond to the mean results. In each figure, the systems are ranked according to their mean results.
  • Figure 5: Pearson correlation between objective and subjective evaluation results, computed using only single-speaker samples (115 samples in total).
  • ...and 1 more figures