Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization
Nikhil Raghav, Md Sahidullah
TL;DR
This paper investigates the robustness of spectral clustering for deep speaker diarization under domain mismatch by performing same-domain and cross-domain evaluations on AMI and DIHARD-III. It employs a SpeechBrain-based SD pipeline with ECAPA-TDNN embeddings and a spectral clustering step that uses a pruning parameter $\alpha$ and an eigendecomposition to estimate the number of speakers $k$, followed by $k$-means clustering. The study shows that the optimal $\alpha$ and the accuracy of speaker-count estimation are highly sensitive to dataset domain, and cross-domain tuning can both help and hinder performance depending on the domain, highlighting limitations of the approach in difficult domains. The findings provide practical guidance for cross-domain diarization and point to future work on adaptive parameter estimation and alternative embeddings to improve robustness.
Abstract
Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
