End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

Prachi Singh; Sriram Ganapathy

End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

Prachi Singh, Sriram Ganapathy

TL;DR

This work addresses speaker diarization by introducing E-SHARC, an end-to-end supervised hierarchical graph clustering framework that jointly learns embeddings and graph-based clustering via a GNN, initialized from pre-trained x-vectors and mel-filterbank features. It extends SHARC with an end-to-end optimization and adds E-SHARC-Overlap to handle overlapped speech through a two-pass, overlap-detection-guided process. Empirical results on AMI, VoxConverse, and DISPLACE show DER improvements over traditional baselines and competitive performance against recent state-of-the-art methods, with additional gains when using VBx resegmentation. The paper also demonstrates representation improvements and provides thorough ablations on hyperparameters, architectures, and overlap handling, highlighting the practical impact of end-to-end graph-based clustering for diarization.

Abstract

Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.

End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

TL;DR

Abstract

Paper Structure (33 sections, 7 equations, 6 figures, 6 tables)

This paper contains 33 sections, 7 equations, 6 figures, 6 tables.

Introduction
Related work
End-to-end neural diarization
Graph clustering algorithms
Overlap detection approaches
Proposed Approach
Background
Notations
Graph initialization
Forward pass
GNN scoring
Clustering
Feature aggregation
Graph generation
Model training
...and 18 more sections

Figures (6)

Figure 1: Block schematic of the E-SHARC algorithm with overlap handling. (a) shows E-SHARC inference containing ETDNN and GNN modules for the first speaker assignment. (b) shows E-SHARC-Overlap for the second speaker assignment approach using an external overlap detector and the GNN module. The arrows points to the k-nearest neighbors of a node after removing the intra-cluster edges.
Figure 2: Block schematic of the E-SHARC training. The ETDNN and GNN modules in black blocks contain learnable parameters. The GNN module generates edge probabilities and weights used in loss computation.
Figure 3: 2D t-SNE plot to compare x-vectors and GNN embeddings for two different recordings from Voxconverse dev set. Ground truth labels are represented as shapes while predictions are represented as colors. In both cases, the proposed E-SHARC yields representations with improved separability. However, the DER for recording-2 deteriorated due to early stopping.
Figure 4: Plot comparing DER performance for k ranging from 20-100 and $\tau \in \{0.0, 0.4, 0.8\}$ for Voxconverse dev set.
Figure 5: Comparison of mean absolute error (MAE) for speaker counting task for different values of $k$ and $\tau$ on Voxconverse Dev. data
...and 1 more figures

End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

TL;DR

Abstract

End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)