CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant
TL;DR
The CHiME-6 paper tackles robust multispeaker ASR with diarization in realistic home environments by introducing two challenge tracks: Track 1 focuses on ASR with ground-truth diarization, while Track 2 addresses full diarization+ASR on unsegmented multispeaker recordings. It presents reproducible Kaldi-based baselines, including accurate array synchronization, speech enhancement front-ends (GSS/WPE and BeamformIt), SAD, diarization, and RTTM refinement, all integrated into a unified CHiME-6 recipe. Baseline results show competitive ASR performance for Track 1 but reveal substantial challenges posed by diarization errors in Track 2, underscoring the importance of joint optimization of separation, diarization, and recognition. The work provides open-source recipes and data-driven methodologies to accelerate progress toward practical, real-world multispeaker ASR systems.
Abstract
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
