Table of Contents
Fetching ...

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

TL;DR

The paper surveys the CHiME-7 and CHiME-8 distant ASR challenges, focusing on multi-channel, real-world meeting transcription and diarization across diverse datasets. It analyzes 32 submissions to identify trends, notably the shift to end-to-end systems enabled by foundation models, continued reliance on guided source separation with TS-VAD diarization refinements, and the critical impact of accurate speaker counting. It also assesses downstream evaluation via meeting summarization with large language models, finding a weak link between transcription accuracy and summary quality, which suggests exploring end-to-end meeting summarization as a practical direction. The work discusses challenge design, baseline systems, and data accessibility, offering guidance for future robust, generalizable DASR benchmarks and practical research directions in diarization and multi-channel speech processing.

Abstract

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

Recent Trends in Distant Conversational Speech Recognition: A Review of CHiME-7 and 8 DASR Challenges

TL;DR

The paper surveys the CHiME-7 and CHiME-8 distant ASR challenges, focusing on multi-channel, real-world meeting transcription and diarization across diverse datasets. It analyzes 32 submissions to identify trends, notably the shift to end-to-end systems enabled by foundation models, continued reliance on guided source separation with TS-VAD diarization refinements, and the critical impact of accurate speaker counting. It also assesses downstream evaluation via meeting summarization with large language models, finding a weak link between transcription accuracy and summary quality, which suggests exploring end-to-end meeting summarization as a practical direction. The work discusses challenge design, baseline systems, and data accessibility, offering guidance for future robust, generalizable DASR benchmarks and practical research directions in diarization and multi-channel speech processing.

Abstract

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

Paper Structure

This paper contains 45 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Mean duration distribution of turn taking events for each C7-8DASR scenario as obtained on the whole train, dev and eval splits.
  • Figure 2: Distribution of mean, maximum and minimum SDR as obtained across microphones for each utterance. We report statistics for each of the 4 scenarios separately.
  • Figure 3: ESPnet and NeMo baseline systems high-level overview. This same scheme was adopted by almost all C7-8DASR participants and top performing systems in the "twin" CH-iME-8 NOTSOFAR-1 challenge.
  • Figure 4: ESPnet baseline diarization pipeline scheme.
  • Figure 6: tcpWER (%) for each C7DASR core scenarios (CH-iME-6, DiPCo, MX6) as well its macro average (Macro) for both C7DASR and C8DASR systems. C8DASR systems are denoted by (C8).
  • ...and 12 more figures