Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

Simon W. McKnight; Aidan O. T. Hogg; Vincent W. Neo; Patrick A. Naylor

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, Patrick A. Naylor

TL;DR

The paper tackles joint speaker diarization and identification (JSID) by leveraging two feature families: modulation spectrum features $\mathbf{\Phi}$ and MFCCs $\mathbf{\Psi}$. A CNN is applied to $\mathbf{\Phi}$ and an LSTM to $\mathbf{\Psi}$, with their outputs fused for final classification, and two uncertainty-aware modeling approaches (aleatoric and epistemic) are explored via probabilistic layers and Monte Carlo dropout. Through two experiments, it demonstrates that fusion of $\mathbf{\Phi}$ and $\mathbf{\Psi}$ substantially improves diarization error rates compared with using either feature alone, and that total uncertainty quantification—especially when combined with Kalman-filter smoothing and model ensembles—can further boost performance, notably in challenging overlapping speech scenarios. The study also provides frame-level entropy diagnostics that reveal when predictions are uncertain, offering a path toward more reliable and interpretable JSID systems with potential online deployment benefits. Overall, the work highlights the complementary information in modulation-spectrum and MFCC features and shows how uncertainty-aware fusion and resegmentation strategies can yield practical gains for speakers’ diarization and identification tasks.

Abstract

This paper studies modulation spectrum features ($Φ$) and mel-frequency cepstral coefficients ($Ψ$) in joint speaker diarization and identification (JSID). JSID is important as speaker diarization on its own to distinguish speakers is insufficient for many applications, it is often necessary to identify speakers as well. Machine learning models are set up using convolutional neural networks (CNNs) on $Φ$ and recurrent neural networks $\unicode{x2013}$ long short-term memory (LSTMs) on $Ψ$, then concatenating into fully connected layers. Experiment 1 shows models on both $Φ$ and $Ψ$ have better diarization error rates (DERs) than models on either alone; a CNN on $Φ$ has DER 29.09\%, compared to 27.78\% for a LSTM on $Ψ$ and 19.44\% for a model on both. Experiment 1 also investigates aleatoric uncertainties and shows the model on both $Φ$ and $Ψ$ has mean entropy 0.927~bits (out of 4~bits) for correct predictions compared to 1.896~bits for incorrect predictions which, along with entropy histogram shapes, shows the model helpfully indicates where it is uncertain. Experiment 2 investigates epistemic uncertainties as well as aleatoric using Monte Carlo dropout (MCD). It compares models on both $Φ$ and $Ψ$ with models trained on x-vectors ($X$), before applying Kalman filter smoothing on epistemic uncertainties for resegmentation and model ensembles. While the two models on $X$ (DERs 10.23\% and 9.74\%) outperform those on $Φ$ and $Ψ$ (DER 17.85\%) after their individual Kalman filter smoothing, combining them using a Kalman filter smoothing method improves the DER to 9.29\%. Aleatoric uncertainties are higher for incorrect predictions. Both Experiments show models on $Φ$ do not distinguish overlapping speakers as well as anticipated. However, Experiment 2 shows model ensembles do better with overlapping speakers than individual models do.

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

TL;DR

The paper tackles joint speaker diarization and identification (JSID) by leveraging two feature families: modulation spectrum features

and MFCCs

. A CNN is applied to

and an LSTM to

, with their outputs fused for final classification, and two uncertainty-aware modeling approaches (aleatoric and epistemic) are explored via probabilistic layers and Monte Carlo dropout. Through two experiments, it demonstrates that fusion of

and

substantially improves diarization error rates compared with using either feature alone, and that total uncertainty quantification—especially when combined with Kalman-filter smoothing and model ensembles—can further boost performance, notably in challenging overlapping speech scenarios. The study also provides frame-level entropy diagnostics that reveal when predictions are uncertain, offering a path toward more reliable and interpretable JSID systems with potential online deployment benefits. Overall, the work highlights the complementary information in modulation-spectrum and MFCC features and shows how uncertainty-aware fusion and resegmentation strategies can yield practical gains for speakers’ diarization and identification tasks.

Abstract

This paper studies modulation spectrum features (

) and mel-frequency cepstral coefficients (

) in joint speaker diarization and identification (JSID). JSID is important as speaker diarization on its own to distinguish speakers is insufficient for many applications, it is often necessary to identify speakers as well. Machine learning models are set up using convolutional neural networks (CNNs) on

and recurrent neural networks

long short-term memory (LSTMs) on

, then concatenating into fully connected layers. Experiment 1 shows models on both

and

have better diarization error rates (DERs) than models on either alone; a CNN on

has DER 29.09\%, compared to 27.78\% for a LSTM on

and 19.44\% for a model on both. Experiment 1 also investigates aleatoric uncertainties and shows the model on both

and

has mean entropy 0.927~bits (out of 4~bits) for correct predictions compared to 1.896~bits for incorrect predictions which, along with entropy histogram shapes, shows the model helpfully indicates where it is uncertain. Experiment 2 investigates epistemic uncertainties as well as aleatoric using Monte Carlo dropout (MCD). It compares models on both

and

with models trained on x-vectors (

), before applying Kalman filter smoothing on epistemic uncertainties for resegmentation and model ensembles. While the two models on

(DERs 10.23\% and 9.74\%) outperform those on

and

(DER 17.85\%) after their individual Kalman filter smoothing, combining them using a Kalman filter smoothing method improves the DER to 9.29\%. Aleatoric uncertainties are higher for incorrect predictions. Both Experiments show models on

do not distinguish overlapping speakers as well as anticipated. However, Experiment 2 shows model ensembles do better with overlapping speakers than individual models do.

Paper Structure (22 sections, 12 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 12 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Modulation Spectrum Background
Uncertainty Quantification Background
Resegmentation
This Research
Analysis
Generating Modulation Spectrum Features $\mathbf{\Phi}$
Uncertainty Quantification
Resegmentation
Scoring Metrics
Experimental Design and Results
Experiment Structure
Datasets and Ground Truth Labels
Features and Systems Used
Generating Features
...and 7 more sections

Figures (7)

Figure 1: Probability calibration graph on validation set for total uncertainties of MCD2-$\bm{\mathcal{X}}_R$ (defined in Section \ref{['ss:feats_systems']}).
Figure 2: 30 s extract of ES2008a for MCD1-$\mathbf{\Phi\Psi}$ before resegmentation: (a) aleatoric uncertainty from the mean $\bar{p}_{l, s}$ and epistemic uncertainty 2.5% to 97.5% percentile range; (b) total uncertainty from fitting truncated Gaussian $\phi_{l, s}$; (c) mean prediction $\bar{y}_{l, s}$; (d) modal prediction $\tilde{y}_{l, s}$; and (e) ground truth. (d) and (e) are offset slightly on y-axis to clarify overlaps. The shaded regions in (a) and (b) show the epistemic uncertainty ranges).
Figure 3: Experiment 1 MCD1-$\mathbf{\Phi\Psi}$ entropies histograms for correct and incorrect modulation frame predictions broken up by actual number of speakers in those modulation frames (none had 4 speakers).
Figure 4: Frame-based errors for threshold in 0.1% increments.
Figure 5: Experiment 1 MCD1-$\mathbf{\Phi\Psi}$ entropies histograms for correct and incorrect predictions of modulation frames.
...and 2 more figures

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

TL;DR

Abstract

Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification

Authors

TL;DR

Abstract

Table of Contents

Figures (7)