Table of Contents
Fetching ...

Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings

Igor Abramovski, Alon Vinnikov, Shalev Shaer, Naoyuki Kanda, Xiaofei Wang, Amir Ivry, Eyal Krupka

TL;DR

The paper analyzes the NOTSOFAR-1 DASR challenge, introducing a realistic 315-meeting recorded dataset across 30 rooms and a 1000-hour simulated training set with 15,000 real ATFs to bridge real-world gaps. It contrasts two pipeline families—Dia-Sep-ASR and CSS-ASR-Dia—showing Dia-Sep-ASR generally outperforms CSS-based approaches, especially in multi-channel settings where spatial cues are advantageous. Top-performing systems leverage advanced diarization (TS-VAD, NSD-MS2S) and ASR adaptations (Enhanced Whisper, ensemble methods) with selective gains from GSS-based separation, TF-domain initialization, or Target Speaker Extraction to augment performance. The findings highlight the critical role of adapting ASR to far-field conditions, the enduring value of spatial information for diarization, and the potential of real-data fine-tuning for neural CSS, charting a path for future DASR research and practical, business-ready meeting transcription systems.

Abstract

The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 280 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.

Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings

TL;DR

The paper analyzes the NOTSOFAR-1 DASR challenge, introducing a realistic 315-meeting recorded dataset across 30 rooms and a 1000-hour simulated training set with 15,000 real ATFs to bridge real-world gaps. It contrasts two pipeline families—Dia-Sep-ASR and CSS-ASR-Dia—showing Dia-Sep-ASR generally outperforms CSS-based approaches, especially in multi-channel settings where spatial cues are advantageous. Top-performing systems leverage advanced diarization (TS-VAD, NSD-MS2S) and ASR adaptations (Enhanced Whisper, ensemble methods) with selective gains from GSS-based separation, TF-domain initialization, or Target Speaker Extraction to augment performance. The findings highlight the critical role of adapting ASR to far-field conditions, the enduring value of spatial information for diarization, and the potential of real-data fine-tuning for neural CSS, charting a path for future DASR research and practical, business-ready meeting transcription systems.

Abstract

The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 280 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.

Paper Structure

This paper contains 36 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Best tcpWER per hashtag in SC and MC tracks
  • Figure 2: Relative increase of best tcpWER in SC track compared to MC per hashtag