Table of Contents
Fetching ...

Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

Maja J. Hjuler, Line H. Clemmensen, Sneha Das

TL;DR

This work tackles explainability in speech emotion recognition under distribution shifts by introducing EmoLIME, a LIME-based method that provides local, frequency-based explanations for end-to-end SER models. It decomposes audio into spectral components and uses a local surrogate trained with a locally weighted loss to assign importance to frequency bands, evaluating on EMODB, RAVDESS, and IEMOCAP with hand-crafted ComParE features and wav2vec 2.0 embeddings. Findings indicate that low-frequency content strongly informs predictions for some emotions with deep features, while high-frequency content ties to arousal; explanations are more robust across models than across datasets, suggesting distribution shifts as a key challenge. The study highlights the potential of combining EmoLIME with global explanations from gradient-based or SHAP methods to enhance trust and interpretability in SER systems.

Abstract

We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.

Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift

TL;DR

This work tackles explainability in speech emotion recognition under distribution shifts by introducing EmoLIME, a LIME-based method that provides local, frequency-based explanations for end-to-end SER models. It decomposes audio into spectral components and uses a local surrogate trained with a locally weighted loss to assign importance to frequency bands, evaluating on EMODB, RAVDESS, and IEMOCAP with hand-crafted ComParE features and wav2vec 2.0 embeddings. Findings indicate that low-frequency content strongly informs predictions for some emotions with deep features, while high-frequency content ties to arousal; explanations are more robust across models than across datasets, suggesting distribution shifts as a key challenge. The study highlights the potential of combining EmoLIME with global explanations from gradient-based or SHAP methods to enhance trust and interpretability in SER systems.

Abstract

We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.

Paper Structure

This paper contains 6 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Functional block diagram of EmoLIME inspired by mishra2017a and haunschmid2020audiolime.
  • Figure 2: Example explanations for the happy expression of a German sentence from EMODB. Components highlighted in green account for a true prediction. Weights are annotated in white. a) Higher weight is given to high-pitch sounds (high frequency) for wav2vec2-SVC. b) The same pattern cannot be recognized for the ComParE-SVC model.
  • Figure 3: Explanations for the angry expression of a German sentence from EMODB. a) More weight is given to low-pitch sounds (low frequency) for wav2vec2-SVC. b) Weights are more uniformly distributed for the ComParE-SVC model.
  • Figure 4: Comparison of spectral decomposition weights for the models based on ComParE (top) vs. deep features (bottom). The weights are computed as the mean across ten utterances per emotion and their standard deviations are illustrated with error bars. Positive component weights account for a prediction of the target emotion. In contrast, negatively weighted components lead the model to predict a different emotion.