Table of Contents
Fetching ...

Investigating Confidence Estimation Measures for Speaker Diarization

Anurag Chowdhury, Abhinav Misra, Mark C. Fuhs, Monika Woszczyna

TL;DR

This work addresses the problem of diarization errors propagating to downstream tasks by developing segment-level confidence measures that can work with both white-box and black-box systems. It evaluates multiple embedding-based scoring methods, including a spectral clustering variant and silhouette-based approaches, across AMI and DoPaCo datasets with ECAPA-TDNN/xVector and E2E diarization pipelines. The findings show that silhouette-based confidence (and related embedding-based methods) consistently reduces the covered diarization error rate (cDER), isolating a significant fraction of errors within the lowest-confidence segments. The results demonstrate practical value for downstream data selection and potential overlap-aware improvements, enabling more reliable speaker labeling in challenging multi-speaker conversations.

Abstract

Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores.

Investigating Confidence Estimation Measures for Speaker Diarization

TL;DR

This work addresses the problem of diarization errors propagating to downstream tasks by developing segment-level confidence measures that can work with both white-box and black-box systems. It evaluates multiple embedding-based scoring methods, including a spectral clustering variant and silhouette-based approaches, across AMI and DoPaCo datasets with ECAPA-TDNN/xVector and E2E diarization pipelines. The findings show that silhouette-based confidence (and related embedding-based methods) consistently reduces the covered diarization error rate (cDER), isolating a significant fraction of errors within the lowest-confidence segments. The results demonstrate practical value for downstream data selection and potential overlap-aware improvements, enabling more reliable speaker labeling in challenging multi-speaker conversations.

Abstract

Speaker diarization systems segment a conversation recording based on the speakers' identity. Such systems can misclassify the speaker of a portion of audio due to a variety of factors, such as speech pattern variation, background noise, and overlapping speech. These errors propagate to, and can adversely affect, downstream systems that rely on the speaker's identity, such as speaker-adapted speech recognition. One of the ways to mitigate these errors is to provide segment-level diarization confidence scores to downstream systems. In this work, we investigate multiple methods for generating diarization confidence scores, including those derived from the original diarization system and those derived from an external model. Our experiments across multiple datasets and diarization systems demonstrate that the most competitive confidence score methods can isolate ~30% of the diarization errors within segments with the lowest ~10% of confidence scores.

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A plot of spectral ratio (SR) vs. speaker error rate (SER) was computed using ECAPA-TDNN speaker embeddings on the eval set of the AMI dataset. No correlation is observed between the SR and SER, indicating that SR is a poor predictor of diarization confidence.
  • Figure 2: A visual representation of the proposed speaker diarization confidence assessment framework.
  • Figure 3: cDER vs Coverage plots using the ECAPA-TDNN based diarization method. Large markers in the plots show the operating points at global thresholds.
  • Figure 4: A comparison of histograms of diarization confidences scores estimated using (a) spectral clustering-based, (b) End-to-End-based and (c) proposed confidence assessment methods on the AMI dataset.
  • Figure 5: A visual representation of the Local Confidence estimation results on a doctor-patient conversation from the DoPaCo dataset. The proposed method assigns lower confidence to most overlapping speech segments.