Table of Contents
Fetching ...

Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

Nikhil Raghav, Md Sahidullah

TL;DR

This paper investigates the robustness of spectral clustering for deep speaker diarization under domain mismatch by performing same-domain and cross-domain evaluations on AMI and DIHARD-III. It employs a SpeechBrain-based SD pipeline with ECAPA-TDNN embeddings and a spectral clustering step that uses a pruning parameter $\alpha$ and an eigendecomposition to estimate the number of speakers $k$, followed by $k$-means clustering. The study shows that the optimal $\alpha$ and the accuracy of speaker-count estimation are highly sensitive to dataset domain, and cross-domain tuning can both help and hinder performance depending on the domain, highlighting limitations of the approach in difficult domains. The findings provide practical guidance for cross-domain diarization and point to future work on adaptive parameter estimation and alternative embeddings to improve robustness.

Abstract

Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.

Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

TL;DR

This paper investigates the robustness of spectral clustering for deep speaker diarization under domain mismatch by performing same-domain and cross-domain evaluations on AMI and DIHARD-III. It employs a SpeechBrain-based SD pipeline with ECAPA-TDNN embeddings and a spectral clustering step that uses a pruning parameter and an eigendecomposition to estimate the number of speakers , followed by -means clustering. The study shows that the optimal and the accuracy of speaker-count estimation are highly sensitive to dataset domain, and cross-domain tuning can both help and hinder performance depending on the domain, highlighting limitations of the approach in difficult domains. The findings provide practical guidance for cross-domain diarization and point to future work on adaptive parameter estimation and alternative embeddings to improve robustness.

Abstract

Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
Paper Structure (16 sections, 3 figures, 3 tables)

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of the steps involved in spectral clustering. The method initiates by extracting speaker embeddings for speech segments. It first calculates the affinity matrix from the $N$ input embeddings situated in a $d$ dimensional space, utilizing cosine similarity as the distance metric. Each row of the affinity matrix prunes smaller values using a tuning parameter $\alpha$. Following symmetrization, it computes an unnormalized Laplacian matrix using the degree matrix $\mathbf{D}$. Subsequently, it applies singular value decomposition (SVD) to the Laplacian matrix $\mathbf{W}$ to derive the leading $k$-eigenvectors. The rows of the eigenvector matrix $\mathbf{U}$ serve as the $k$-dimensional spectral embeddings. Lastly, it performs the standard $k$-means algorithm to cluster these embeddings.
  • Figure 2: Plot showing the impact of the tuning parameter $\alpha$ on DER for the three microphone types in AMI development subsets. The circles represent the lowest DERs corresponding to each condition. [Best view in color.]
  • Figure 3: Plot showing the impact of the tuning parameter $\alpha$ on DER for the seven domains in DIHARD III development subsets. The circles represent the lowest DERs corresponding to each condition. [Best view in color.]