Table of Contents
Fetching ...

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

TL;DR

NOTSOFAR-1 targets distant conversational speech recognition in far-field meetings by introducing two core resources: a real benchmark with ~315 meetings across ~30 rooms and a 1000-hour ATF-rich simulated training set, plus an open-source baseline pipeline for CSS, ASR, and diarization. It defines two tracks (single-channel and known-geometry multi-channel) and two evaluation metrics (tcpWER and a speaker-agnostic WER) to incentivize geometry-aware, data-driven front-ends. The paper addresses a critical data bottleneck in DASR by providing standardized datasets, a scalable simulation framework, and a publicly available baseline to enable principled comparisons. Together, these resources aim to improve generalization to real-world far-field conditions and accelerate progress in distant meeting transcription and diarization research.

Abstract

We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.

NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

TL;DR

NOTSOFAR-1 targets distant conversational speech recognition in far-field meetings by introducing two core resources: a real benchmark with ~315 meetings across ~30 rooms and a 1000-hour ATF-rich simulated training set, plus an open-source baseline pipeline for CSS, ASR, and diarization. It defines two tracks (single-channel and known-geometry multi-channel) and two evaluation metrics (tcpWER and a speaker-agnostic WER) to incentivize geometry-aware, data-driven front-ends. The paper addresses a critical data bottleneck in DASR by providing standardized datasets, a scalable simulation framework, and a publicly available baseline to enable principled comparisons. Together, these resources aim to improve generalization to real-world far-field conditions and accelerate progress in distant meeting transcription and diarization research.

Abstract

We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First, a benchmarking dataset of 315 meetings, averaging 6 minutes each, capturing a broad spectrum of real-world acoustic conditions and conversational dynamics. It is recorded across 30 conference rooms, featuring 4-8 attendees and a total of 35 unique speakers. Second, a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. The tasks focus on single-device DASR, where multi-channel devices always share the same known geometry. This is aligned with common setups in actual conference rooms, and avoids technical complexities associated with multi-device tasks. It also allows for the development of geometry-specific solutions. The NOTSOFAR-1 Challenge aims to advance research in the field of distant conversational speech recognition, providing key resources to unlock the potential of data-driven methods, which we believe are currently constrained by the absence of comprehensive high-quality training and benchmarking datasets.
Paper Structure (9 sections, 1 equation, 2 tables)