Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues
Seham Nasr, Zhao Ren, David Johnson
TL;DR
The paper tackles the limited faithfulness of saliency-based explanations in speech emotion recognition (SER) by grounding saliency in expert-referenced acoustic cues. It introduces a framework that quantifies cue magnitudes within salient regions and links them to theory-driven emotion markers, using Occlusion Sensitivity and Concept Relevance Propagation to produce more interpretable explanations. Experiments on CREMA-D and TESSSP2 demonstrate that salient regions carry psychoacoustically meaningful cues, with high-arousal emotions showing stronger, more variable markers and misclassifications revealing implausible cue alignments, indicating improved explanation reliability. This approach advances trustworthy, theory-aligned XAI for speech affective computing and can inform broader applications in speech event detection and pathology monitoring.
Abstract
Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies "what" is highlighted and connects it to "why" it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.
