SCDiar: a streaming diarization system based on speaker change detection and speech recognition

Naijun Zheng; Xucheng Wan; Kai Liu; Zhou Huan

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Zhou Huan

TL;DR

SCDiar tackles the problem of real-time speaker diarization for hours-long meetings in online SA-ASR by introducing a streaming pipeline that segments speech at the token level via token-level SCD, followed by optimized segment selection before clustering in SD. The approach hinges on three innovations: a length-aware segment-token similarity matrix $A$ learned by the SD network; representative segment selection through the optimization $min_x ||A x - 1||^2$ with $0 \le x \le 1$; and a streaming speaker label mapping with a cache to update embeddings, all trained with Transcript-Preserving Speaker Transfer (TPSP). The paper also uses a multi-target loss and data augmentation to train the ASR-SCD-SD stack, achieving strong results on AISHELL-4 and a challenging in-house dataset and narrowing the gap to offline systems. Empirical findings demonstrate that SCDiar delivers substantial gains over previous online methods (up to about 53.6% in accuracy on real-world meetings with many participants) while maintaining real-time processing capability, highlighting its practical impact for scalable, live diarization. The mathematical formulations, such as the segment-token similarity matrix $A$ and the representative-segment optimization, provide a principled foundation for robust streaming SA-ASR in multi-speaker settings.

Abstract

In hours-long meeting scenarios, real-time speech stream often struggles with achieving accurate speaker diarization, commonly leading to speaker identification and speaker count errors. To address this challenge, we propose SCDiar, a system that operates on speech segments, split at the token level by a speaker change detection (SCD) module. Building on these segments, we introduce several enhancements to efficiently select the best available segment for each speaker. These improvements lead to significant gains across various benchmarks. Notably, on real-world meeting data involving more than ten participants, SCDiar outperforms previous systems by up to 53.6\% in accuracy, substantially narrowing the performance gap between online and offline systems.

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

TL;DR

learned by the SD network; representative segment selection through the optimization

with

; and a streaming speaker label mapping with a cache to update embeddings, all trained with Transcript-Preserving Speaker Transfer (TPSP). The paper also uses a multi-target loss and data augmentation to train the ASR-SCD-SD stack, achieving strong results on AISHELL-4 and a challenging in-house dataset and narrowing the gap to offline systems. Empirical findings demonstrate that SCDiar delivers substantial gains over previous online methods (up to about 53.6% in accuracy on real-world meetings with many participants) while maintaining real-time processing capability, highlighting its practical impact for scalable, live diarization. The mathematical formulations, such as the segment-token similarity matrix

and the representative-segment optimization, provide a principled foundation for robust streaming SA-ASR in multi-speaker settings.

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

TL;DR

Abstract

SCDiar: a streaming diarization system based on speaker change detection and speech recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)