Table of Contents
Fetching ...

Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency

Roman Aperdannier, Sigurd Schacht, Alexander Piazza

TL;DR

The paper tackles the absence of standardized latency benchmarks for online speaker diarization by systematically comparing multiple online systems on the same hardware and dataset. It evaluates the DIART framework with various embedding and segmentation models, along with UIS-RNN-SML and FS-EEND, focusing on the latency from audio input to speaker label output. Key findings show that DIART with pyannote/embedding and pyannote/segmentation achieves the lowest mean latency (approximately 0.057 s), FS-EEND performs similarly well (approximately 0.058 s), while UIS-RNN-SML exhibits latency that grows with streaming length, making it unsuitable for long inputs. The work provides a practical baseline for latency-aware diarization and highlights trade-offs between modular online pipelines and end-to-end streaming approaches, with future work suggesting joint accuracy-latency optimizations and scalability to more speakers.

Abstract

In this paper, different online speaker diarization systems are evaluated on the same hardware with the same test data with regard to their latency. The latency is the time span from audio input to the output of the corresponding speaker label. As part of the evaluation, various model combinations within the DIART framework, a diarization system based on the online clustering algorithm UIS-RNN-SML, and the end-to-end online diarization system FS-EEND are compared. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding and the segmentation model pyannote/segmentation. The FS-EEND system shows a similarly good latency. In general there is currently no published research that compares several online diarization systems in terms of their latency. This makes this work even more relevant.

Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency

TL;DR

The paper tackles the absence of standardized latency benchmarks for online speaker diarization by systematically comparing multiple online systems on the same hardware and dataset. It evaluates the DIART framework with various embedding and segmentation models, along with UIS-RNN-SML and FS-EEND, focusing on the latency from audio input to speaker label output. Key findings show that DIART with pyannote/embedding and pyannote/segmentation achieves the lowest mean latency (approximately 0.057 s), FS-EEND performs similarly well (approximately 0.058 s), while UIS-RNN-SML exhibits latency that grows with streaming length, making it unsuitable for long inputs. The work provides a practical baseline for latency-aware diarization and highlights trade-offs between modular online pipelines and end-to-end streaming approaches, with future work suggesting joint accuracy-latency optimizations and scalability to more speakers.

Abstract

In this paper, different online speaker diarization systems are evaluated on the same hardware with the same test data with regard to their latency. The latency is the time span from audio input to the output of the corresponding speaker label. As part of the evaluation, various model combinations within the DIART framework, a diarization system based on the online clustering algorithm UIS-RNN-SML, and the end-to-end online diarization system FS-EEND are compared. The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding and the segmentation model pyannote/segmentation. The FS-EEND system shows a similarly good latency. In general there is currently no published research that compares several online diarization systems in terms of their latency. This makes this work even more relevant.
Paper Structure (26 sections, 1 equation, 2 figures, 3 tables)

This paper contains 26 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Latency UIS-RNN-SML per chunk
  • Figure 2: Total execution time UIS-RNN-SML