Table of Contents
Fetching ...

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

Zebang Cheng, Fuqiang Niu, Yuxiang Lin, Zhi-Qi Cheng, Bowen Zhang, Xiaojiang Peng

TL;DR

The paper tackles emotion-cause pair extraction in multimodal conversations (MECPE) by proposing MER-MCE, a two-stage framework that separately handles multimodal emotion recognition (MER) and multimodal cause extraction (MCE). MER-MCE combines modality-specific encoders for text, audio, and visuals with an attention-based fusion, and employs a Multimodal Language Model to infer emotion causes from contextual cues. On the ECF dataset, it achieves a weighted F1 of $0.3435$, ranking third in SemEval-2024 Task 3 Subtask 2, demonstrating the value of integrating multiple modalities and generative reasoning for emotion-cause inference. The work highlights practical benefits for realistic dialogue systems and sets a foundation for further robustness and broader modality integration.

Abstract

This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team. Project: https://github.com/MIPS-COLT/MER-MCE.git

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

TL;DR

The paper tackles emotion-cause pair extraction in multimodal conversations (MECPE) by proposing MER-MCE, a two-stage framework that separately handles multimodal emotion recognition (MER) and multimodal cause extraction (MCE). MER-MCE combines modality-specific encoders for text, audio, and visuals with an attention-based fusion, and employs a Multimodal Language Model to infer emotion causes from contextual cues. On the ECF dataset, it achieves a weighted F1 of , ranking third in SemEval-2024 Task 3 Subtask 2, demonstrating the value of integrating multiple modalities and generative reasoning for emotion-cause inference. The work highlights practical benefits for realistic dialogue systems and sets a foundation for further robustness and broader modality integration.

Abstract

This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team. Project: https://github.com/MIPS-COLT/MER-MCE.git
Paper Structure (16 sections, 2 equations, 5 figures, 3 tables)

This paper contains 16 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An example of an annotated conversation from the ECF dataset. Dashed lines connect each emotion label to its corresponding cause utterance, illustrating the emotion-cause pairs present in the conversation. The image modality provides additional context and cues for understanding the expressed emotions.
  • Figure 2: The architecture of our proposed MER-MCE framework for multimodal emotion-cause pair extraction in conversations. The framework consists of two main stages: (a) Multimodal Emotion Recognition (MER), which utilizes specialized emotion encoders to extract modality-specific features from text, audio, and visual data, and (b) Multimodal Cause Extraction (MCE), which employs a Multimodal Language Model to integrate contextual information from the conversation and visual cues to identify the utterances that trigger the recognized emotions.
  • Figure 3: Prompt template for guiding the Multimodal LLM in sentiment analysis and emotion cause extraction from conversational data.
  • Figure 4: The line graph depicting scores and historical conversation windows.
  • Figure 5: The confusion matrix of multimodal emotion recognition result.