Table of Contents
Fetching ...

LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task

Suyash Vardhan Mathur, Akshett Rai Jindal, Hardik Mittal, Manish Shrivastava

TL;DR

This paper tackles multimodal emotion cause analysis in conversations by formulating MC-ECPE as a three-step task: emotion identification per utterance, candidate-cause detection, and emotion-cause pairing. It introduces three baselines with varying degrees of sequence context, employing multimodal encoders (text, audio, video) and explores both BiLSTM and BiLSTM-CRF architectures. Evaluations on the MC-ECPE dataset (Friends-derived) show that a BiLSTM-based baseline with emotion-tailored encoders (EmotionRoBERTa, WavLM, MViTv2) achieves the best leaderboard performance (Wt. F1 0.1836, Macro F1 0.1759), ranking 8th. The results suggest that short conversation lengths limit the benefit of longer context and that encoder choices trained on emotion tasks provide stronger signals; future work could explore joint multimodal embeddings and speaker-aware representations to further improve performance.

Abstract

Conversation is the most natural form of human communication, where each utterance can range over a variety of possible emotions. While significant work has been done towards the detection of emotions in text, relatively little work has been done towards finding the cause of the said emotions, especially in multimodal settings. SemEval 2024 introduces the task of Multimodal Emotion Cause Analysis in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual modalities) along with the corresponding utterances that were the cause for the emotion. In this paper, we propose models that tackle this task as an utterance labeling and a sequence labeling problem and perform a comparative study of these models, involving baselines using different encoders, using BiLSTM for adding contextual information of the conversation, and finally adding a CRF layer to try to model the inter-dependencies between adjacent utterances more effectively. In the official leaderboard for the task, our architecture was ranked 8th, achieving an F1-score of 0.1759 on the leaderboard.

LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task

TL;DR

This paper tackles multimodal emotion cause analysis in conversations by formulating MC-ECPE as a three-step task: emotion identification per utterance, candidate-cause detection, and emotion-cause pairing. It introduces three baselines with varying degrees of sequence context, employing multimodal encoders (text, audio, video) and explores both BiLSTM and BiLSTM-CRF architectures. Evaluations on the MC-ECPE dataset (Friends-derived) show that a BiLSTM-based baseline with emotion-tailored encoders (EmotionRoBERTa, WavLM, MViTv2) achieves the best leaderboard performance (Wt. F1 0.1836, Macro F1 0.1759), ranking 8th. The results suggest that short conversation lengths limit the benefit of longer context and that encoder choices trained on emotion tasks provide stronger signals; future work could explore joint multimodal embeddings and speaker-aware representations to further improve performance.

Abstract

Conversation is the most natural form of human communication, where each utterance can range over a variety of possible emotions. While significant work has been done towards the detection of emotions in text, relatively little work has been done towards finding the cause of the said emotions, especially in multimodal settings. SemEval 2024 introduces the task of Multimodal Emotion Cause Analysis in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual modalities) along with the corresponding utterances that were the cause for the emotion. In this paper, we propose models that tackle this task as an utterance labeling and a sequence labeling problem and perform a comparative study of these models, involving baselines using different encoders, using BiLSTM for adding contextual information of the conversation, and finally adding a CRF layer to try to model the inter-dependencies between adjacent utterances more effectively. In the official leaderboard for the task, our architecture was ranked 8th, achieving an F1-score of 0.1759 on the leaderboard.
Paper Structure (11 sections, 2 figures, 1 table)

This paper contains 11 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Model Architecture
  • Figure 2: Emotion frequency in the dataset