Table of Contents
Fetching ...

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan

TL;DR

This work tackles long-form speaker diarization by eliminating speaker embeddings and applying end-to-end diarization at both local and global scales. It introduces a three-step pipeline—local EEND, global EEND, and clustering—where the global step uses pairwise speaker chunks processed by EEND to produce an embedding-free affinity for spectral clustering. The approach achieves 13% and 10% relative DER reductions over 1-pass EEND on CHAE and RT03-CTS, and offers competitive gains over embedding-based baselines without extra embedding losses. Efficiency analyses show that batching and frame-subset strategies can reduce processing time by up to around 70% without harming diarization accuracy, enabling scalable deployment for long recordings with many speakers.

Abstract

End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

TL;DR

This work tackles long-form speaker diarization by eliminating speaker embeddings and applying end-to-end diarization at both local and global scales. It introduces a three-step pipeline—local EEND, global EEND, and clustering—where the global step uses pairwise speaker chunks processed by EEND to produce an embedding-free affinity for spectral clustering. The approach achieves 13% and 10% relative DER reductions over 1-pass EEND on CHAE and RT03-CTS, and offers competitive gains over embedding-based baselines without extra embedding losses. Efficiency analyses show that batching and frame-subset strategies can reduce processing time by up to around 70% without harming diarization accuracy, enabling scalable deployment for long recordings with many speakers.

Abstract

End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.
Paper Structure (16 sections, 7 equations, 2 figures, 4 tables)

This paper contains 16 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Local-global EEND framework. This assumes 3 local windows with 2-speaker local EEND, i.e $W$=3, $S_{local}$=2 resulting in $C$=12 pairwise-speaker chunks for global EEND.
  • Figure 2: RTF vs DER with different strategies on efficiency improvement, including batching the inferences and minimizing the number of frames required for each speaker. N indicates a subset of N random frames.