Probabilistic Fusion and Calibration of Neural Speaker Diarization Models
Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos, Alicia Lozano-Diez
TL;DR
This work addresses calibration and fusion of End-to-End Neural Diarization (EEND) outputs by introducing a probability-level framework that supports both multilabel and powerset representations. It demonstrates that joint calibration in powerset space, combined with Fuse-then-Calibrate and Dynamic Logits fusion, yields the best DER improvements on CallHome, while providing reliable confidence estimates for downstream use. The study systematically compares unsupervised and supervised fusion methods, analyzes the impact of probability-space choices, and reveals that calibration quality and task performance do not always align, underscoring the need for task-aware calibration objectives. Overall, the paper establishes best practices for probability-level fusion and calibration in neural diarization, enabling more accurate and uncertainty-aware multi-speaker segmentation beyond hard-decision approaches.
Abstract
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
