Table of Contents
Fetching ...

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Zhou Huan

TL;DR

SCDiar tackles the problem of real-time speaker diarization for hours-long meetings in online SA-ASR by introducing a streaming pipeline that segments speech at the token level via token-level SCD, followed by optimized segment selection before clustering in SD. The approach hinges on three innovations: a length-aware segment-token similarity matrix $A$ learned by the SD network; representative segment selection through the optimization $min_x ||A x - 1||^2$ with $0 \le x \le 1$; and a streaming speaker label mapping with a cache to update embeddings, all trained with Transcript-Preserving Speaker Transfer (TPSP). The paper also uses a multi-target loss and data augmentation to train the ASR-SCD-SD stack, achieving strong results on AISHELL-4 and a challenging in-house dataset and narrowing the gap to offline systems. Empirical findings demonstrate that SCDiar delivers substantial gains over previous online methods (up to about 53.6% in accuracy on real-world meetings with many participants) while maintaining real-time processing capability, highlighting its practical impact for scalable, live diarization. The mathematical formulations, such as the segment-token similarity matrix $A$ and the representative-segment optimization, provide a principled foundation for robust streaming SA-ASR in multi-speaker settings.

Abstract

In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

TL;DR

SCDiar tackles the problem of real-time speaker diarization for hours-long meetings in online SA-ASR by introducing a streaming pipeline that segments speech at the token level via token-level SCD, followed by optimized segment selection before clustering in SD. The approach hinges on three innovations: a length-aware segment-token similarity matrix learned by the SD network; representative segment selection through the optimization with ; and a streaming speaker label mapping with a cache to update embeddings, all trained with Transcript-Preserving Speaker Transfer (TPSP). The paper also uses a multi-target loss and data augmentation to train the ASR-SCD-SD stack, achieving strong results on AISHELL-4 and a challenging in-house dataset and narrowing the gap to offline systems. Empirical findings demonstrate that SCDiar delivers substantial gains over previous online methods (up to about 53.6% in accuracy on real-world meetings with many participants) while maintaining real-time processing capability, highlighting its practical impact for scalable, live diarization. The mathematical formulations, such as the segment-token similarity matrix and the representative-segment optimization, provide a principled foundation for robust streaming SA-ASR in multi-speaker settings.

Abstract

In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.

Paper Structure

This paper contains 13 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) The overview of inference process in the SCDiar system. (b) The structure of the SCD module. (c) The structure of the SD module.
  • Figure 2: An example for TPST Wang2024DiarizationLMSD.
  • Figure 3: Segment-token similarity matrix from (a) cosine distance, (b) estimated ${\bf{A}}^T$ and (c) target ${\bar{\bf{A}}}^T$. Reference speaker IDs are listed at the bottom with 7 segments.
  • Figure 4: ASR and diarization results on AISHELL-4 with different maximum active segment lengths of the VAD.