EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning
Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki, Tomoki Toda, Masato Akagi
TL;DR
The paper tackles SER in multilingual, real-world contexts where intra-utterance code-switching and mixed emotions are common, addressing the limitations of monolingual, single-label datasets. It introduces EM$^{2}$LDL, a multilingual speech corpus annotated with 32-emotion distributions per utterance under the LDL paradigm, including English, Mandarin, and Cantonese with code-switching. Data are collected from online platforms to ensure ecological validity and are complemented by 20 human raters per utterance to derive probabilistic emotion distributions, enabling nuanced modeling of complex affective states. Baseline experiments using diverse self-supervised speech models reveal that while English-pretrained SSLs generally perform best, all models struggle to perfectly capture the fine-grained distributions, highlighting the importance of code-switch-aware and demographic-inclusive SER approaches and establishing EM$^{2}$LDL as a robust testbed for advancing affective computing in multilingual, culturally diverse settings.
Abstract
This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at https://github.com/xingfengli/EM2LDL.
