Table of Contents
Fetching ...

MSP-Conversation: A Corpus for Naturalistic, Time-Continuous Emotion Recognition

Luz Martinez-Lucas, Pravin Mote, Abinay Reddy Naini, Mohammed Abdelwahab, Carlos Busso

Abstract

Affective computing aims to understand and model human emotions for computational systems. Within this field, speech emotion recognition (SER) focuses on predicting emotions conveyed through speech. While early SER systems relied on limited datasets and traditional machine learning models, recent deep learning approaches demand largescale, naturalistic emotional corpora. To address this need, we introduce the MSP-Conversation corpus: a dataset of more than 70 hours of conversational audio with time-continuous emotional annotations and detailed speaker diarizations. The time-continuous annotations capture the dynamic and contextdependent nature of emotional expression. The annotations in the corpus include fine-grained temporal traces of valence, arousal, and dominance. The audio data is sourced from publicly available podcasts and overlaps with a subset of the isolated speaking turns in the MSP-Podcast corpus to facilitate direct comparisons between annotation methods (i.e., in-context versus out-of-context annotations). The paper outlines the development of the corpus, annotation methodology, analyses of the annotations, and baseline SER experiments, establishing the MSP-Conversation corpus as a valuable resource for advancing research in dynamic SER in naturalistic settings.

MSP-Conversation: A Corpus for Naturalistic, Time-Continuous Emotion Recognition

Abstract

Affective computing aims to understand and model human emotions for computational systems. Within this field, speech emotion recognition (SER) focuses on predicting emotions conveyed through speech. While early SER systems relied on limited datasets and traditional machine learning models, recent deep learning approaches demand largescale, naturalistic emotional corpora. To address this need, we introduce the MSP-Conversation corpus: a dataset of more than 70 hours of conversational audio with time-continuous emotional annotations and detailed speaker diarizations. The time-continuous annotations capture the dynamic and contextdependent nature of emotional expression. The annotations in the corpus include fine-grained temporal traces of valence, arousal, and dominance. The audio data is sourced from publicly available podcasts and overlaps with a subset of the isolated speaking turns in the MSP-Podcast corpus to facilitate direct comparisons between annotation methods (i.e., in-context versus out-of-context annotations). The paper outlines the development of the corpus, annotation methodology, analyses of the annotations, and baseline SER experiments, establishing the MSP-Conversation corpus as a valuable resource for advancing research in dynamic SER in naturalistic settings.
Paper Structure (23 sections, 2 equations, 6 figures, 7 tables)

This paper contains 23 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Graphic user interface of CARMA during annotation of the MSP-Conversation corpus. The example shows a valence annotation of the MSP-Conversation_0227_2 conversation part.
  • Figure 2: Histograms of the attribute values of the mean emotional trace samples, showing the emotional content of the MSP-Conversation corpus.
  • Figure 3: Inter-evaluator agreement differences between excluding then including each annotator. A higher difference value correlates with a more reliable rater. We use Cronbach's Alpha to estimate the agreements between annotators.
  • Figure 4: Process of deriving sentence-level labels analogous to MSP-Podcast labels from the emotional traces in the MSP-Conversation corpus. The figure shows an example using the arousal traces of two workers (traces shown in green and orange). The process starts with obtaining the timing of the MSP-Podcast speaking turns overlapping with the current conversation. The trace values within that timing are aggregated using some aggregation function. Then the aggregated values are averaged to obtain the sentence-level labels.
  • Figure 5: Graphic user interface of ELAN software during the speaker diarization the MSP-Conversation_0021 conversation.
  • ...and 1 more figures