JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

Arefa; Mohammed Abbas Ansari; Chandni Saxena; Tanvir Ahmad

JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

Arefa, Mohammed Abbas Ansari, Chandni Saxena, Tanvir Ahmad

TL;DR

This work tackles multimodal emotion-cause analysis in conversations (ECAC) using a two-step framework that first identifies emotions and then extracts their causes from multimodal utterances in the ECDF Friends dataset. It proposes two complementary routes: (i) instruction-tuned Llama-2 models for separate emotion recognition and cause prediction, and (ii) GPT-4V-based video captioning combined with GPT-3.5 in-context learning for emotion and cause extraction, including retrieval-augmented demonstrations. The GPT-driven approach yields better results (rank 4 on SemEval-2024) and ablations highlight the value of context and self-causes, while video captions have mixed impact. The study demonstrates practical, cost-aware strategies for leveraging large language models to fuse text, audio, and video cues in emotion-cause analysis, with code made available on GitHub.

Abstract

This paper presents our system development for SemEval-2024 Task 3: "The Competition of Multimodal Emotion Cause Analysis in Conversations". Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video. However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system. Our proposed approach addresses these challenges by a two-step framework. We adopt two different approaches in our implementation. In Approach 1, we employ instruction-tuning with two separate Llama 2 models for emotion and cause prediction. In Approach 2, we use GPT-4V for conversation-level video description and employ in-context learning with annotated conversation using GPT 3.5. Our system wins rank 4, and system ablation experiments demonstrate that our proposed solutions achieve significant performance gains. All the experimental codes are available on Github.

JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 20 figures, 8 tables)

This paper contains 36 sections, 6 equations, 20 figures, 8 tables.

Introduction
Background
Task definition
Related Work
Dataset
Class-distribution
Relative positions of emotion and causes
Methodology
Overview
Approach 1: Fine-tuned Llama-2
Emotion recognition
Cause prediction
Adding video captions
Approach 2: In-Context-Learning GPT
Video Captioning
...and 21 more sections

Figures (20)

Figure 1: Percentage of each of the seven emotion categories
Figure 2: Relative position of emotion and causes
Figure 3: Pipeline for fine-tuning Llama (Approach 1)
Figure 4: Pipeline of In-Context-Learning GPT Method (Approach 2)
Figure 5: Video Captioning Pipeline
...and 15 more figures

JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

TL;DR

Abstract

JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)