Table of Contents
Fetching ...

Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition

Oliver Schrüfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn Schuller

TL;DR

This paper addresses the challenge of reliable uncertainty quantification in real‑world speech emotion recognition by evaluating four UQ approaches (Entropy, MC Dropout, Evidential Deep Learning, and Prior Networks) across diverse SER datasets and realistic distortions, including unknown emotions, non‑speech data, and corrupted signals. It demonstrates that simple entropy‑based UQ provides useful uncertainty signals, but reliability improves when UQ is integrated into the model via Prior Networks and when the model is trained with OOD data. The study finds that Prior Networks, especially when exposed to OOD data, offer the strongest separation between in‑domain and out‑of‑domain inputs, while MC Dropout and EDL show more limited or noise‑sensitive performance. The results offer practical guidance for deploying SER systems with calibrated uncertainty in real environments and highlight avenues for calibration and better separation of data versus distributional uncertainty. Overall, the work provides a first systematic, cross‑dataset assessment of UQ methods in SER under realistic challenges, informing future development of reliable, real‑world SER systems.

Abstract

Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.

Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition

TL;DR

This paper addresses the challenge of reliable uncertainty quantification in real‑world speech emotion recognition by evaluating four UQ approaches (Entropy, MC Dropout, Evidential Deep Learning, and Prior Networks) across diverse SER datasets and realistic distortions, including unknown emotions, non‑speech data, and corrupted signals. It demonstrates that simple entropy‑based UQ provides useful uncertainty signals, but reliability improves when UQ is integrated into the model via Prior Networks and when the model is trained with OOD data. The study finds that Prior Networks, especially when exposed to OOD data, offer the strongest separation between in‑domain and out‑of‑domain inputs, while MC Dropout and EDL show more limited or noise‑sensitive performance. The results offer practical guidance for deploying SER systems with calibrated uncertainty in real environments and highlight avenues for calibration and better separation of data versus distributional uncertainty. Overall, the work provides a first systematic, cross‑dataset assessment of UQ methods in SER under realistic challenges, informing future development of reliable, real‑world SER systems.

Abstract

Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.
Paper Structure (22 sections, 2 figures, 3 tables)

This paper contains 22 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Simplexes of Dirichlet Distributions for a 3-class problem with sharp (a,b) and flat (c) distributions
  • Figure 2: Overview of the performance of the 5 uq methods on different tests. Higher values on the $x$-axis (and $y$-axis in the bottom row) represent higher uncertainty in all cases. The first row shows CDF plots for correct and wrong predictions on the test data of all 3 SER datasets. The middle row shows CDF plots of the uncertainty on speech (blue) and non-speech datasets (red + grey). The bottom row shows the mean uncertainty for EmoDB test data augmented with noise for different SNR levels with 95% confidence interval (left axis) and UAR (right axis) for different SNRs.