Table of Contents
Fetching ...

Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

Rongchen Guo, Vincent Francoeur, Isar Nejadgholi, Sylvain Gagnon, Miodrag Bolic

TL;DR

This paper tackles the challenge of distinguishing stimulus-intended versus speaker-evoked emotions in Speech Emotion Recognition (SER) by introducing a two-semantic-role framework: descriptive semantics (content narrative) and expressive semantics (emotional stance). Using a dataset of 97 participants describing emotionally charged movie clips, the authors employ Whisper for ASR and GPT-4o for semantic segmentation, then train discriminative and regression models on three tasks: intended emotion classification, evoked emotion classification, and valence/arousal regression. They demonstrate that descriptive semantics better predict intended emotions, while expressive semantics better capture evoked emotions and their dimensional ratings, with audio-based baselines underperforming. The approach offers interpretable, context-aware signals for SER and has implications for human-AI interaction, though it relies on movie-based elicitation and subjectivity in self-reports, highlighting ethical considerations for real-world deployment.

Abstract

Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker's emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants' self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

TL;DR

This paper tackles the challenge of distinguishing stimulus-intended versus speaker-evoked emotions in Speech Emotion Recognition (SER) by introducing a two-semantic-role framework: descriptive semantics (content narrative) and expressive semantics (emotional stance). Using a dataset of 97 participants describing emotionally charged movie clips, the authors employ Whisper for ASR and GPT-4o for semantic segmentation, then train discriminative and regression models on three tasks: intended emotion classification, evoked emotion classification, and valence/arousal regression. They demonstrate that descriptive semantics better predict intended emotions, while expressive semantics better capture evoked emotions and their dimensional ratings, with audio-based baselines underperforming. The approach offers interpretable, context-aware signals for SER and has implications for human-AI interaction, though it relies on movie-based elicitation and subjectivity in self-reports, highlighting ethical considerations for real-world deployment.

Abstract

Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker's emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants' self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

Paper Structure

This paper contains 13 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Data Collection and Algorithm Workflow: Participants watched six videos eliciting specific emotions and provided speech descriptions, emotion ratings, and valence/arousal scores. Speech data were transcribed, segmented into descriptive and expressive semantics, and used to train models for three tasks: predicting intended emotions (TASK-1), evoked emotions (TASK-2), and valence/arousal (TASK-3).
  • Figure 2: Examples of participants' rated emotions. Each row represents a participant who watched six movie segments (6 columns) from each of the six emotional categories. The intended emotion tag associated with the video is plotted in a yellow bar. Other rated emotions are colored blue. The height of the bars represents the emotion ratings from participants. For example, in the second movie clip watched by Participant P93, the intended emotion was "disgust," as shown by the yellow bar. After watching the clip, P93 reported experiencing four emotions: disgust, fear, sadness, and surprise, indicated by the blue bars. Among these, "disgust" was the strongest emotion, receiving the highest score of 6.
  • Figure 3: Valence and arousal ratings, colored by the intended emotion tags of movie segments.