Table of Contents
Fetching ...

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset

Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Björn W. Schuller, Haizhou Li

TL;DR

MC-EIU introduces a large-scale, open-source benchmark for joint emotion and intent understanding in multimodal conversations across English and Mandarin, featuring 7 emotion classes, 9 intents, and three modalities. It provides the EI$^2$ network to model multimodal dialog history and a deep emotion–intent interaction mechanism, achieving state-of-the-art performance on both emotion and intent tasks in both languages. Through extensive baselines and ablations, the work demonstrates the importance of history modeling, cross-modal fusion, and interaction-focused architectures for reliable joint understanding. The dataset and reference model establish a foundation for reproducible research in affective computing and cross-lingual multimodal dialogue systems.

Abstract

Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history, while inferring the emotions and intents simultaneously for the current utterance. MC-EIU is enabling technology for many human-computer interfaces. However, there is a lack of available datasets in terms of annotation, modality, language diversity, and accessibility. In this work, we propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, i.e., English and Mandarin. Furthermore, it is completely open-source for free access. To our knowledge, MC-EIU is the first comprehensive and rich emotion and intent joint understanding dataset for multimodal conversation. Together with the release of the dataset, we also develop an Emotion and Intent Interaction (EI$^2$) network as a reference system by modeling the deep correlation between emotion and intent in the multimodal conversation. With comparative experiments and ablation studies, we demonstrate the effectiveness of the proposed EI$^2$ method on the MC-EIU dataset. The dataset and codes will be made available at: https://github.com/MC-EIU/MC-EIU.

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset

TL;DR

MC-EIU introduces a large-scale, open-source benchmark for joint emotion and intent understanding in multimodal conversations across English and Mandarin, featuring 7 emotion classes, 9 intents, and three modalities. It provides the EI network to model multimodal dialog history and a deep emotion–intent interaction mechanism, achieving state-of-the-art performance on both emotion and intent tasks in both languages. Through extensive baselines and ablations, the work demonstrates the importance of history modeling, cross-modal fusion, and interaction-focused architectures for reliable joint understanding. The dataset and reference model establish a foundation for reproducible research in affective computing and cross-lingual multimodal dialogue systems.

Abstract

Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history, while inferring the emotions and intents simultaneously for the current utterance. MC-EIU is enabling technology for many human-computer interfaces. However, there is a lack of available datasets in terms of annotation, modality, language diversity, and accessibility. In this work, we propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, i.e., English and Mandarin. Furthermore, it is completely open-source for free access. To our knowledge, MC-EIU is the first comprehensive and rich emotion and intent joint understanding dataset for multimodal conversation. Together with the release of the dataset, we also develop an Emotion and Intent Interaction (EI) network as a reference system by modeling the deep correlation between emotion and intent in the multimodal conversation. With comparative experiments and ablation studies, we demonstrate the effectiveness of the proposed EI method on the MC-EIU dataset. The dataset and codes will be made available at: https://github.com/MC-EIU/MC-EIU.
Paper Structure (45 sections, 10 equations, 8 figures, 12 tables)

This paper contains 45 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Visualization of the correlation between emotions and intents in the MC-EIU dataset. Each circle in the graph represents the sample count for a specific 'emotion-intent' pair.
  • Figure 2: Overview of the Emotion-Intent Interaction (EI$^{2}$) Network. The modules with fire symbols indicate the need for pre-training.
  • Figure 3: The samples of audio and video files in MC-EIU dataset. All files are named according to a consistent format.
  • Figure 4: Layout of the annotation platform.
  • Figure 5: The structure of the feature set.
  • ...and 3 more figures