Table of Contents
Fetching ...

Probabilistic Fusion and Calibration of Neural Speaker Diarization Models

Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos, Alicia Lozano-Diez

TL;DR

This work addresses calibration and fusion of End-to-End Neural Diarization (EEND) outputs by introducing a probability-level framework that supports both multilabel and powerset representations. It demonstrates that joint calibration in powerset space, combined with Fuse-then-Calibrate and Dynamic Logits fusion, yields the best DER improvements on CallHome, while providing reliable confidence estimates for downstream use. The study systematically compares unsupervised and supervised fusion methods, analyzes the impact of probability-space choices, and reveals that calibration quality and task performance do not always align, underscoring the need for task-aware calibration objectives. Overall, the paper establishes best practices for probability-level fusion and calibration in neural diarization, enabling more accurate and uncertainty-aware multi-speaker segmentation beyond hard-decision approaches.

Abstract

End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.

Probabilistic Fusion and Calibration of Neural Speaker Diarization Models

TL;DR

This work addresses calibration and fusion of End-to-End Neural Diarization (EEND) outputs by introducing a probability-level framework that supports both multilabel and powerset representations. It demonstrates that joint calibration in powerset space, combined with Fuse-then-Calibrate and Dynamic Logits fusion, yields the best DER improvements on CallHome, while providing reliable confidence estimates for downstream use. The study systematically compares unsupervised and supervised fusion methods, analyzes the impact of probability-space choices, and reveals that calibration quality and task performance do not always align, underscoring the need for task-aware calibration objectives. Overall, the paper establishes best practices for probability-level fusion and calibration in neural diarization, enabling more accurate and uncertainty-aware multi-speaker segmentation beyond hard-decision approaches.

Abstract

End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.

Paper Structure

This paper contains 28 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Post-hoc Calibration Process
  • Figure 2: Modular framework design with three independent configuration decisions (left) and an example horizontal processing pipeline (right). The framework allows independent selection of: (1) calibration-fusion ordering (C→F: Calibrate-then-Fuse or F→C: Fuse-then-Calibrate), (2) calibration probability space (Mult: Multilabel or Power: Powerset), and (3) fusion probability space. The example pipeline demonstrates the F→C strategy with fusion in multilabel space followed by calibration in powerset space.
  • Figure 3: DER (%) vs BCE comparison on Powerset space for different models and fusion methods.
  • Figure 4: DER (%) components for individual models and the best fusion method at different processing stages.
  • Figure 5: BCE for individual models and the best fusion method at different processing stages.