Table of Contents
Fetching ...

Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues

Seham Nasr, Zhao Ren, David Johnson

TL;DR

The paper tackles the limited faithfulness of saliency-based explanations in speech emotion recognition (SER) by grounding saliency in expert-referenced acoustic cues. It introduces a framework that quantifies cue magnitudes within salient regions and links them to theory-driven emotion markers, using Occlusion Sensitivity and Concept Relevance Propagation to produce more interpretable explanations. Experiments on CREMA-D and TESSSP2 demonstrate that salient regions carry psychoacoustically meaningful cues, with high-arousal emotions showing stronger, more variable markers and misclassifications revealing implausible cue alignments, indicating improved explanation reliability. This approach advances trustworthy, theory-aligned XAI for speech affective computing and can inform broader applications in speech event detection and pathology monitoring.

Abstract

Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies "what" is highlighted and connects it to "why" it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.

Beyond saliency: enhancing explanation of speech emotion recognition with expert-referenced acoustic cues

TL;DR

The paper tackles the limited faithfulness of saliency-based explanations in speech emotion recognition (SER) by grounding saliency in expert-referenced acoustic cues. It introduces a framework that quantifies cue magnitudes within salient regions and links them to theory-driven emotion markers, using Occlusion Sensitivity and Concept Relevance Propagation to produce more interpretable explanations. Experiments on CREMA-D and TESSSP2 demonstrate that salient regions carry psychoacoustically meaningful cues, with high-arousal emotions showing stronger, more variable markers and misclassifications revealing implausible cue alignments, indicating improved explanation reliability. This approach advances trustworthy, theory-aligned XAI for speech affective computing and can inform broader applications in speech event detection and pathology monitoring.

Abstract

Explainable AI (XAI) for Speech Emotion Recognition (SER) is critical for building transparent, trustworthy models. Current saliency-based methods, adapted from vision, highlight spectrogram regions but fail to show whether these regions correspond to meaningful acoustic markers of emotion, limiting faithfulness and interpretability. We propose a framework that overcomes these limitations by quantifying the magnitudes of cues within salient regions. This clarifies "what" is highlighted and connects it to "why" it matters, linking saliency to expert-referenced acoustic cues of speech emotions. Experiments on benchmark SER datasets show that our approach improves explanation quality by explicitly linking salient regions to theory-driven speech emotions expert-referenced acoustics. Compared to standard saliency methods, it provides more understandable and plausible explanations of SER models, offering a foundational step towards trustworthy speech-based affective computing.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The proposed framework. Saliency maps are segmented (bottom), and top regions are projected onto the log-Mel spectrogram and waveform. The resulting temporal segments are analyzed for expert-referenced features, with "Mean" scores mapped to speech emotions.
  • Figure 2: Samples instance-level saliency maps from XAI methods in our approach for emotion "sad". (a) OS XAI, (b) CRP XAI, and (c) and (d) highlight the salient regions projected on the input log-mel spectrogram.