On the calibration of powerset speaker diarization models

Alexis Plaquet; Hervé Bredin

On the calibration of powerset speaker diarization models

Alexis Plaquet, Hervé Bredin

TL;DR

It is found that top-label confidence can be used to reliably predict high-error regions and training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.

Abstract

End-to-end neural diarization models have usually relied on a multilabel-classification formulation of the speaker diarization problem. Recently, we proposed a powerset multiclass formulation that has beaten the state-of-the-art on multiple datasets. In this paper, we propose to study the calibration of a powerset speaker diarization model, and explore some of its uses. We study the calibration in-domain, as well as out-of-domain, and explore the data in low-confidence regions. The reliability of model confidence is then tested in practice: we use the confidence of the pretrained model to selectively create training and validation subsets out of unannotated data, and compare this to random selection. We find that top-label confidence can be used to reliably predict high-error regions. Moreover, training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.

On the calibration of powerset speaker diarization models

TL;DR

Abstract

Paper Structure (10 sections, 2 equations, 6 figures)

This paper contains 10 sections, 2 equations, 6 figures.

Introduction
Model calibration
Metrics
In-domain and out-of-domain calibration
Analysis of low-confidence regions
Annotation-efficient domain adaptation
Finding a minimal training subset
Finding a minimal validation subset
Conclusion
Acknowledgements

Figures (6)

Figure 1: Calibration error as a function of the DER of the powerset segmentation model. In-domain datasets (top figure) are plotted with blue circles, out-of-domain datasets (bottom figure) are plotted with red diamonds. To give a frame of reference, the blue circles of in-domain datasets are also overlayed in transparency on the bottom figure.
Figure 2: Best and worst in-domain calibration, measured by ECE. The left column is a classical reliability diagram, a perfect ECE would mean no "difference to mean confidence" in every bin, resulting in a diagonal plot. The right column is the same plot but with DER instead of classification error.
Figure 3: Reliability diagram and binwise DER distributions for the best and worst calibrated domains in DIHARD.
Figure 4: Composition of the diarization error rate when sampling 5 seconds chunks, lowest confidence chunks are selected first. The dashed lines show the composition of the DER on the whole test set.
Figure 5: Composition of the data categorized in nonspeech, speech and overlap, when sampling 5 seconds chunks like in \ref{['fig:der-distribution']}. Dashed lines show the average distribution on the whole test set.
...and 1 more figures

On the calibration of powerset speaker diarization models

TL;DR

Abstract

On the calibration of powerset speaker diarization models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)