Table of Contents
Fetching ...

Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

Vikramjit Mitra, Amrit Romana, Dung T. Tran, Erdrin Azemi

TL;DR

This work tackles label uncertainty in spontaneous speech emotion by predicting the full distribution of graders' emotions as a pdf target rather than a single consensus label. It introduces a saliency-driven layer-selection approach for foundation-model representations to improve both dimensional and categorical emotion recognition, evaluated on MSP-Podcast with cross-corpus robustness. The results achieve SOTA performance but reveal limited speaker-generalization under 1-best predictions, which improves when considering 2-best or 3-best hypotheses, underscoring the impact of data skew and ambiguous emotions. Overall, the method advances robust, speaker-aware emotion modeling under varying acoustic conditions and noisy data, while highlighting important gaps in cross-speaker generalization and fairness across genders.

Abstract

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

TL;DR

This work tackles label uncertainty in spontaneous speech emotion by predicting the full distribution of graders' emotions as a pdf target rather than a single consensus label. It introduces a saliency-driven layer-selection approach for foundation-model representations to improve both dimensional and categorical emotion recognition, evaluated on MSP-Podcast with cross-corpus robustness. The results achieve SOTA performance but reveal limited speaker-generalization under 1-best predictions, which improves when considering 2-best or 3-best hypotheses, underscoring the impact of data skew and ambiguous emotions. Overall, the method advances robust, speaker-aware emotion modeling under varying acoustic conditions and noisy data, while highlighting important gaps in cross-speaker generalization and fairness across genders.

Abstract

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

Paper Structure

This paper contains 15 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Multi-task emotion recognition model
  • Figure 2: Dimensional emotion estimation for different transformer layers in WavLM
  • Figure 3: WavLM layer saliency by valence, happy and angry emotion
  • Figure 4: Speaker-level performance (UAR from Whisper TC-GRU) plotted against emotion distributions, for speakers in Eval1.6.
  • Figure 5: Confusion matrices showing the relationship between 1st and 2nd best model hypotheses from Whisper TC-GRU and the Eval1.6 test set.