Table of Contents
Fetching ...

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Samuele Cornell, Taejin Park, Steve Huang, Christoph Boeddeker, Xuankai Chang, Matthew Maciejewski, Matthew Wiesner, Paola Garcia, Shinji Watanabe

TL;DR

This paper introduces CH-iME-8 DASR, a challenge to advance generalizable, multi-channel distant automatic speech recognition and diarization across heterogeneous devices and speaker counts. It extends prior CH-iME work by adding NOTSOFAR-1, restructured development data for Mixer 6, an LLM-enabled track, and a jury award, all supported by a data-preparation toolkit. Two baselines (ESPnet-based and NeMo-based) illustrate array-agnostic pipelines combining diarization, guided source separation, and ASR, evaluated with time-constrained cpWER ($tcpWER$) across four core scenarios. The results reveal that accurate total speaker counting and robust diarization are the main bottlenecks, and they demonstrate the potential of LLM-assisted approaches in improving transcription and attribution in long-form, multi-device conversations. The work provides practical resources and a standardized evaluation protocol to accelerate progress toward robust, real-world DASR systems.

Abstract

This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

TL;DR

This paper introduces CH-iME-8 DASR, a challenge to advance generalizable, multi-channel distant automatic speech recognition and diarization across heterogeneous devices and speaker counts. It extends prior CH-iME work by adding NOTSOFAR-1, restructured development data for Mixer 6, an LLM-enabled track, and a jury award, all supported by a data-preparation toolkit. Two baselines (ESPnet-based and NeMo-based) illustrate array-agnostic pipelines combining diarization, guided source separation, and ASR, evaluated with time-constrained cpWER () across four core scenarios. The results reveal that accurate total speaker counting and robust diarization are the main bottlenecks, and they demonstrate the potential of LLM-assisted approaches in improving transcription and attribution in long-form, multi-device conversations. The work provides practical resources and a standardized evaluation protocol to accelerate progress toward robust, real-world DASR systems.

Abstract

This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.
Paper Structure (16 sections, 1 figure, 4 tables)

This paper contains 16 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: ESPNet and NeMo baseline systems basic overview.