Table of Contents
Fetching ...

A Review of Common Online Speaker Diarization Methods

Roman Aperdannier, Sigurd Schacht, Alexander Piazza

TL;DR

The paper surveys online speaker diarization, outlining the problem of labeling 'who spoke when' with low latency and reviewing historical and contemporary approaches. It contrasts modular pipelines (GMM, i-vector, UIS RNN, Turn to Diarize) with end-to-end frameworks (FS-EEND, Minivox), detailing metrics (DER, JER), datasets (CALLHOME, NIST RT, DIHARD, VoxConverse), and core mechanisms. Key insights include the evolution from hand-crafted components to trainable, end-to-end models, the tradeoffs between latency and accuracy, and the persistent challenge of data scarcity and flexible speaker counts in online settings. The work highlights practical implications for real-time transcription and live analytics, while emphasizing ongoing research needed to achieve robust, low-latency diarization in diverse, multispeaker scenarios.

Abstract

Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.

A Review of Common Online Speaker Diarization Methods

TL;DR

The paper surveys online speaker diarization, outlining the problem of labeling 'who spoke when' with low latency and reviewing historical and contemporary approaches. It contrasts modular pipelines (GMM, i-vector, UIS RNN, Turn to Diarize) with end-to-end frameworks (FS-EEND, Minivox), detailing metrics (DER, JER), datasets (CALLHOME, NIST RT, DIHARD, VoxConverse), and core mechanisms. Key insights include the evolution from hand-crafted components to trainable, end-to-end models, the tradeoffs between latency and accuracy, and the persistent challenge of data scarcity and flexible speaker counts in online settings. The work highlights practical implications for real-time transcription and live analytics, while emphasizing ongoing research needed to achieve robust, low-latency diarization in diverse, multispeaker scenarios.

Abstract

Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.
Paper Structure (31 sections, 4 equations, 3 figures, 1 table)