End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization
Prachi Singh, Sriram Ganapathy
TL;DR
This work addresses speaker diarization by introducing E-SHARC, an end-to-end supervised hierarchical graph clustering framework that jointly learns embeddings and graph-based clustering via a GNN, initialized from pre-trained x-vectors and mel-filterbank features. It extends SHARC with an end-to-end optimization and adds E-SHARC-Overlap to handle overlapped speech through a two-pass, overlap-detection-guided process. Empirical results on AMI, VoxConverse, and DISPLACE show DER improvements over traditional baselines and competitive performance against recent state-of-the-art methods, with additional gains when using VBx resegmentation. The paper also demonstrates representation improvements and provides thorough ablations on hyperparameters, architectures, and overlap handling, highlighting the practical impact of end-to-end graph-based clustering for diarization.
Abstract
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.
