Table of Contents
Fetching ...

Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models

Anna Derington, Hagen Wierstorf, Ali Özkil, Florian Eyben, Felix Burkhardt, Björn W. Schuller

TL;DR

The paper tackles the problem that speech emotion recognition (SER) models with similar accuracy can exhibit divergent and potentially problematic behaviors. It introduces an offline, multi-faceted testing framework that categorizes model behavior into correctness, fairness, and robustness, and provides automatic methods to set fairness thresholds. The authors evaluate eleven acoustic foundation models and a CNN baseline on arousal, dominance, valence, and emotional categories using a large battery of 2,029 tests, uncovering that high performance can come with sentiment-based shortcuts and linguistic biases, as well as issues with cross-language and noise robustness. The proposed framework offers a practical toolkit for developers and researchers to diagnose, compare, and improve SER models before deployment, with open resources and detailed results available online.

Abstract

Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.

Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models

TL;DR

The paper tackles the problem that speech emotion recognition (SER) models with similar accuracy can exhibit divergent and potentially problematic behaviors. It introduces an offline, multi-faceted testing framework that categorizes model behavior into correctness, fairness, and robustness, and provides automatic methods to set fairness thresholds. The authors evaluate eleven acoustic foundation models and a CNN baseline on arousal, dominance, valence, and emotional categories using a large battery of 2,029 tests, uncovering that high performance can come with sentiment-based shortcuts and linguistic biases, as well as issues with cross-language and noise robustness. The proposed framework offers a practical toolkit for developers and researchers to diagnose, compare, and improve SER models before deployment, with open resources and detailed results available online.

Abstract

Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. We evaluated a xLSTM-based and nine transformer-based acoustic foundation models against a convolutional baseline model, testing their performance on arousal, valence, dominance, and emotional category classification. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.
Paper Structure (15 sections, 7 equations, 4 figures, 8 tables)

This paper contains 15 sections, 7 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Maximum difference in mean value among 1000 simulations with a varying number of groups and number of samples per group.
  • Figure 2: Percentage of passed tests averaged over all models presented with standard deviation for all tests involving a dimensional emotion task. Corr stands for Correctness, Rob for Robustness, spk for speaker, backg for background, qual for quality, rec for recording, and cond for condition.
  • Figure 3: Predictions of hubert-L for arousal (left), dominance (centre), valence (right) on the ravdess test set, split by the categorical emotions the samples are annotated for in ravdess. The green area marks the region in which a dimensional prediction would be rated as consistent with the annotated emotional category by the Correctness Consistency tests.
  • Figure 4: Confusion matrices for the prediction of emotional categories by hubert-L on the clean msppodcast test set 1 comparing to the msppodcast test set 1 when adding coughing with an SNR of 10dB (left), sneezing with an SNR of 10dB (centre), or white noise with an SNR of 20dB (right).