Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift
Maja J. Hjuler, Line H. Clemmensen, Sneha Das
TL;DR
This work tackles explainability in speech emotion recognition under distribution shifts by introducing EmoLIME, a LIME-based method that provides local, frequency-based explanations for end-to-end SER models. It decomposes audio into spectral components and uses a local surrogate trained with a locally weighted loss to assign importance to frequency bands, evaluating on EMODB, RAVDESS, and IEMOCAP with hand-crafted ComParE features and wav2vec 2.0 embeddings. Findings indicate that low-frequency content strongly informs predictions for some emotions with deep features, while high-frequency content ties to arousal; explanations are more robust across models than across datasets, suggesting distribution shifts as a key challenge. The study highlights the potential of combining EmoLIME with global explanations from gradient-based or SHAP methods to enhance trust and interpretability in SER systems.
Abstract
We introduce EmoLIME, a version of local interpretable model-agnostic explanations (LIME) for black-box Speech Emotion Recognition (SER) models. To the best of our knowledge, this is the first attempt to apply LIME in SER. EmoLIME generates high-level interpretable explanations and identifies which specific frequency ranges are most influential in determining emotional states. The approach aids in interpreting complex, high-dimensional embeddings such as those generated by end-to-end speech models. We evaluate EmoLIME, qualitatively, quantitatively, and statistically, across three emotional speech datasets, using classifiers trained on both hand-crafted acoustic features and Wav2Vec 2.0 embeddings. We find that EmoLIME exhibits stronger robustness across different models than across datasets with distribution shifts, highlighting its potential for more consistent explanations in SER tasks within a dataset.
