MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

Zebang Cheng; Fuqiang Niu; Yuxiang Lin; Zhi-Qi Cheng; Bowen Zhang; Xiaojiang Peng

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

Zebang Cheng, Fuqiang Niu, Yuxiang Lin, Zhi-Qi Cheng, Bowen Zhang, Xiaojiang Peng

TL;DR

The paper tackles emotion-cause pair extraction in multimodal conversations (MECPE) by proposing MER-MCE, a two-stage framework that separately handles multimodal emotion recognition (MER) and multimodal cause extraction (MCE). MER-MCE combines modality-specific encoders for text, audio, and visuals with an attention-based fusion, and employs a Multimodal Language Model to infer emotion causes from contextual cues. On the ECF dataset, it achieves a weighted F1 of $0.3435$, ranking third in SemEval-2024 Task 3 Subtask 2, demonstrating the value of integrating multiple modalities and generative reasoning for emotion-cause inference. The work highlights practical benefits for realistic dialogue systems and sets a foundation for further robustness and broader modality integration.

Abstract

This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team. Project: https://github.com/MIPS-COLT/MER-MCE.git

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

TL;DR

, ranking third in SemEval-2024 Task 3 Subtask 2, demonstrating the value of integrating multiple modalities and generative reasoning for emotion-cause inference. The work highlights practical benefits for realistic dialogue systems and sets a foundation for further robustness and broader modality integration.

Abstract

Paper Structure (16 sections, 2 equations, 5 figures, 3 tables)

This paper contains 16 sections, 2 equations, 5 figures, 3 tables.

Introduction
System Overview
Multimodal Emotion Recognition
Multimodal Cause Extraction
Experiments
Experimental Setup
Evaluation Metrics
Emotion Recognition Analysis
Cause Extraction Analysis
Error Analysis of the Entire System
Conclusion
Acknowledgements
Appendix
Experimental Data
Experimental Setup
...and 1 more sections

Figures (5)

Figure 1: An example of an annotated conversation from the ECF dataset. Dashed lines connect each emotion label to its corresponding cause utterance, illustrating the emotion-cause pairs present in the conversation. The image modality provides additional context and cues for understanding the expressed emotions.
Figure 2: The architecture of our proposed MER-MCE framework for multimodal emotion-cause pair extraction in conversations. The framework consists of two main stages: (a) Multimodal Emotion Recognition (MER), which utilizes specialized emotion encoders to extract modality-specific features from text, audio, and visual data, and (b) Multimodal Cause Extraction (MCE), which employs a Multimodal Language Model to integrate contextual information from the conversation and visual cues to identify the utterances that trigger the recognized emotions.
Figure 3: Prompt template for guiding the Multimodal LLM in sentiment analysis and emotion cause extraction from conversational data.
Figure 4: The line graph depicting scores and historical conversation windows.
Figure 5: The confusion matrix of multimodal emotion recognition result.

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

TL;DR

Abstract

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)