Table of Contents
Fetching ...

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

TL;DR

This work addresses the challenge of meeting transcription from single-channel recordings by proposing a modular CSS-AD pipeline that combines Continuous Speech Separation with ASR-informed diarization. The approach leverages TF-GridNet for separation, sentence-level and word-level segmentation cues derived from ASR, and d-vector clustering to assign segments to speakers. It achieves state-of-the-art performance on Libri-CSS, obtaining competitive ORC WER and, notably, a new high-performance cpWER when segmentation is guided by transcription. The results demonstrate the practical viability of a modular pipeline for robust, single-channel meeting transcription with strong diarization and recognition performance, suggesting directions for future integration and optimization.

Abstract

We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

TL;DR

This work addresses the challenge of meeting transcription from single-channel recordings by proposing a modular CSS-AD pipeline that combines Continuous Speech Separation with ASR-informed diarization. The approach leverages TF-GridNet for separation, sentence-level and word-level segmentation cues derived from ASR, and d-vector clustering to assign segments to speakers. It achieves state-of-the-art performance on Libri-CSS, obtaining competitive ORC WER and, notably, a new high-performance cpWER when segmentation is guided by transcription. The results demonstrate the practical viability of a modular pipeline for robust, single-channel meeting transcription with strong diarization and recognition performance, suggesting directions for future integration and optimization.

Abstract

We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.
Paper Structure (20 sections, 2 figures, 3 tables)

This paper contains 20 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed processing pipeline for meeting transcription.
  • Figure 2: Comparison of the different segmentation schemes. Here, colors represent speakers. SB and SC stand for sentence boundary and speaker change.