Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles
Rongchen Guo, Vincent Francoeur, Isar Nejadgholi, Sylvain Gagnon, Miodrag Bolic
TL;DR
This paper tackles the challenge of distinguishing stimulus-intended versus speaker-evoked emotions in Speech Emotion Recognition (SER) by introducing a two-semantic-role framework: descriptive semantics (content narrative) and expressive semantics (emotional stance). Using a dataset of 97 participants describing emotionally charged movie clips, the authors employ Whisper for ASR and GPT-4o for semantic segmentation, then train discriminative and regression models on three tasks: intended emotion classification, evoked emotion classification, and valence/arousal regression. They demonstrate that descriptive semantics better predict intended emotions, while expressive semantics better capture evoked emotions and their dimensional ratings, with audio-based baselines underperforming. The approach offers interpretable, context-aware signals for SER and has implications for human-AI interaction, though it relies on movie-based elicitation and subjectivity in self-reports, highlighting ethical considerations for real-world deployment.
Abstract
Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker's emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants' self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.
