Table of Contents
Fetching ...

On the calibration of powerset speaker diarization models

Alexis Plaquet, Hervé Bredin

TL;DR

It is found that top-label confidence can be used to reliably predict high-error regions and training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.

Abstract

End-to-end neural diarization models have usually relied on a multilabel-classification formulation of the speaker diarization problem. Recently, we proposed a powerset multiclass formulation that has beaten the state-of-the-art on multiple datasets. In this paper, we propose to study the calibration of a powerset speaker diarization model, and explore some of its uses. We study the calibration in-domain, as well as out-of-domain, and explore the data in low-confidence regions. The reliability of model confidence is then tested in practice: we use the confidence of the pretrained model to selectively create training and validation subsets out of unannotated data, and compare this to random selection. We find that top-label confidence can be used to reliably predict high-error regions. Moreover, training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.

On the calibration of powerset speaker diarization models

TL;DR

It is found that top-label confidence can be used to reliably predict high-error regions and training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.

Abstract

End-to-end neural diarization models have usually relied on a multilabel-classification formulation of the speaker diarization problem. Recently, we proposed a powerset multiclass formulation that has beaten the state-of-the-art on multiple datasets. In this paper, we propose to study the calibration of a powerset speaker diarization model, and explore some of its uses. We study the calibration in-domain, as well as out-of-domain, and explore the data in low-confidence regions. The reliability of model confidence is then tested in practice: we use the confidence of the pretrained model to selectively create training and validation subsets out of unannotated data, and compare this to random selection. We find that top-label confidence can be used to reliably predict high-error regions. Moreover, training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.
Paper Structure (10 sections, 2 equations, 6 figures)

This paper contains 10 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Calibration error as a function of the DER of the powerset segmentation model. In-domain datasets (top figure) are plotted with blue circles, out-of-domain datasets (bottom figure) are plotted with red diamonds. To give a frame of reference, the blue circles of in-domain datasets are also overlayed in transparency on the bottom figure.
  • Figure 2: Best and worst in-domain calibration, measured by ECE. The left column is a classical reliability diagram, a perfect ECE would mean no "difference to mean confidence" in every bin, resulting in a diagonal plot. The right column is the same plot but with DER instead of classification error.
  • Figure 3: Reliability diagram and binwise DER distributions for the best and worst calibrated domains in DIHARD.
  • Figure 4: Composition of the diarization error rate when sampling 5 seconds chunks, lowest confidence chunks are selected first. The dashed lines show the composition of the DER on the whole test set.
  • Figure 5: Composition of the data categorized in nonspeech, speech and overlap, when sampling 5 seconds chunks like in \ref{['fig:der-distribution']}. Dashed lines show the average distribution on the whole test set.
  • ...and 1 more figures