Table of Contents
Fetching ...

EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki, Tomoki Toda, Masato Akagi

TL;DR

The paper tackles SER in multilingual, real-world contexts where intra-utterance code-switching and mixed emotions are common, addressing the limitations of monolingual, single-label datasets. It introduces EM$^{2}$LDL, a multilingual speech corpus annotated with 32-emotion distributions per utterance under the LDL paradigm, including English, Mandarin, and Cantonese with code-switching. Data are collected from online platforms to ensure ecological validity and are complemented by 20 human raters per utterance to derive probabilistic emotion distributions, enabling nuanced modeling of complex affective states. Baseline experiments using diverse self-supervised speech models reveal that while English-pretrained SSLs generally perform best, all models struggle to perfectly capture the fine-grained distributions, highlighting the importance of code-switch-aware and demographic-inclusive SER approaches and establishing EM$^{2}$LDL as a robust testbed for advancing affective computing in multilingual, culturally diverse settings.

Abstract

This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.

EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

TL;DR

The paper tackles SER in multilingual, real-world contexts where intra-utterance code-switching and mixed emotions are common, addressing the limitations of monolingual, single-label datasets. It introduces EMLDL, a multilingual speech corpus annotated with 32-emotion distributions per utterance under the LDL paradigm, including English, Mandarin, and Cantonese with code-switching. Data are collected from online platforms to ensure ecological validity and are complemented by 20 human raters per utterance to derive probabilistic emotion distributions, enabling nuanced modeling of complex affective states. Baseline experiments using diverse self-supervised speech models reveal that while English-pretrained SSLs generally perform best, all models struggle to perfectly capture the fine-grained distributions, highlighting the importance of code-switch-aware and demographic-inclusive SER approaches and establishing EMLDL as a robust testbed for advancing affective computing in multilingual, culturally diverse settings.

Abstract

This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.

Paper Structure

This paper contains 30 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the multilingual data acquisition and annotation pipeline. Video segments containing multilingual and code-switched emotional speech are collected from online platforms (left). Audio tracks are extracted and segmented into utterances (center). The resulting speech samples are then annotated by human raters using a label distribution format (right).
  • Figure 2: Distribution of speech segments across online platforms and content categories.
  • Figure 3: Illustration of Plutchik’s emotion wheel used as the basis for mixed emotion annotation.
  • Figure 4: Label distribution examples computed from human ratings for two speech samples.
  • Figure 5: Frequency distribution of each single emotional category in the whole EM$^2$LDL corpus.
  • ...and 5 more figures