Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

Zijun Jia; Jinsong Yu; Hongyu Long; Diyin Tang

Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

Zijun Jia, Jinsong Yu, Hongyu Long, Diyin Tang

TL;DR

This work tackles the unreliability of traditional speech emotion recognition in safety-critical settings by introducing a coverage-guaranteed, calibration-based framework built on Conformal Prediction. By combining Split Conformal Prediction with risk-controlled conformal prediction and a mini-batch online calibration scheme for non-exchangeable data, the approach delivers prediction sets that contain the true emotion label with probability at least $1-\alpha$ while allowing task-specific loss control. The methods are validated on IEMOCAP and TESS using Mel-spectrogram features and multiple CNN backbones, showing robust coverage across datasets and resilience to distribution shifts, including non-exchangeable online scenarios. The work further introduces APSS as a practical uncertainty metric and demonstrates that a smaller prediction set at higher risk levels correlates with reduced uncertainty, supporting real-time deployment in dynamic environments.

Abstract

Road rage, often triggered by emotional suppression and sudden outbursts, significantly threatens road safety by causing collisions and aggressive behavior. Speech emotion recognition technologies can mitigate this risk by identifying negative emotions early and issuing timely alerts. However, current SER methods, such as those based on hidden markov models and Long short-term memory networks, primarily handle one-dimensional signals, frequently experience overfitting, and lack calibration, limiting their safety-critical effectiveness. We propose a novel risk-controlled prediction framework providing statistically rigorous guarantees on prediction accuracy. This approach employs a calibration set to define a binary loss function indicating whether the true label is included in the prediction set. Using a data-driven threshold $β$, we optimize a joint loss function to maintain an expected test loss bounded by a user-specified risk level $α$. Evaluations across six baseline models and two benchmark datasets demonstrate our framework consistently achieves a minimum coverage of $1 - α$, effectively controlling marginal error rates despite varying calibration-test split ratios (e.g., 0.1). The robustness and generalizability of the framework are further validated through an extension to small-batch online calibration under a local exchangeability assumption. We construct a non-negative test martingale to maintain prediction validity even in dynamic and non-exchangeable environments. Cross-dataset tests confirm our method's ability to uphold reliable statistical guarantees in realistic, evolving data scenarios.

Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

TL;DR

while allowing task-specific loss control. The methods are validated on IEMOCAP and TESS using Mel-spectrogram features and multiple CNN backbones, showing robust coverage across datasets and resilience to distribution shifts, including non-exchangeable online scenarios. The work further introduces APSS as a practical uncertainty metric and demonstrates that a smaller prediction set at higher risk levels correlates with reduced uncertainty, supporting real-time deployment in dynamic environments.

Abstract

, we optimize a joint loss function to maintain an expected test loss bounded by a user-specified risk level

. Evaluations across six baseline models and two benchmark datasets demonstrate our framework consistently achieves a minimum coverage of

, effectively controlling marginal error rates despite varying calibration-test split ratios (e.g., 0.1). The robustness and generalizability of the framework are further validated through an extension to small-batch online calibration under a local exchangeability assumption. We construct a non-negative test martingale to maintain prediction validity even in dynamic and non-exchangeable environments. Cross-dataset tests confirm our method's ability to uphold reliable statistical guarantees in realistic, evolving data scenarios.

Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

TL;DR

Abstract

Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)