Table of Contents
Fetching ...

LyS at SemEval-2024 Task 3: An Early Prototype for End-to-End Multimodal Emotion Linking as Graph-Based Parsing

Ana Ezquerro, David Vilares

TL;DR

This work tackles Multimodal Emotion Cause Analysis in Conversations by proposing an end-to-end graph-based parser that treats utterances as nodes and emotion-cause relations as labeled edges. A large multimodal encoder (text, image, and audio) contextualizes utterances, while a biaffine graph-based decoder predicts the adjacency structure and trigger spans, aided by a span-attention mechanism and speaker-relative encoding. The approach achieves a 7th-place finish in Subtask 1 (text-only) and provides post-evaluation insights for Subtask 2, highlighting the value of multimodal inputs—especially audio—and the importance of span prediction for learning. The results motivate lighter, distilled multimodal models and suggest practical paths for end-to-end emotion-cause analysis with graph-based parsing in real-world settings.

Abstract

This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs.

LyS at SemEval-2024 Task 3: An Early Prototype for End-to-End Multimodal Emotion Linking as Graph-Based Parsing

TL;DR

This work tackles Multimodal Emotion Cause Analysis in Conversations by proposing an end-to-end graph-based parser that treats utterances as nodes and emotion-cause relations as labeled edges. A large multimodal encoder (text, image, and audio) contextualizes utterances, while a biaffine graph-based decoder predicts the adjacency structure and trigger spans, aided by a span-attention mechanism and speaker-relative encoding. The approach achieves a 7th-place finish in Subtask 1 (text-only) and provides post-evaluation insights for Subtask 2, highlighting the value of multimodal inputs—especially audio—and the importance of span prediction for learning. The results motivate lighter, distilled multimodal models and suggest practical paths for end-to-end emotion-cause analysis with graph-based parsing in real-world settings.

Abstract

This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs.
Paper Structure (23 sections, 6 figures, 2 tables)

This paper contains 23 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example taken from the official website of the SemEval Task 3 - https://nustm.github.io/SemEval-2024_ECAC/. The goal of the task consists of predicting (i) the emotion associated to each utterance within the conversation, (ii) the cause-effect relations that trigger the emotions between utterances and (iii) the associated span in the cause utterance.
  • Figure 2: High-level architecture of our system. The encoder takes as input the sequence of $m$ utterances of a given conversation and returns a unique vector representation for each utterance. The decoder uses the utterance embedding matrix to apply the affine attention product in the decoder, obtain the scores of the adjacent matrix and return the predicted sequence of emotions and the cause relations between utterances.
  • Figure 3: High level representation of the textual encoder. The input (1) is the matrix of stacked token vectors of each utterance. The last hidden states of BERT are used as word embeddings (2) and the special CLS tokens are used as utterance embeddings (3). The effect embeddings (4) - a partial representation from the decoder - are taken as input to the span module with the contextualized BERT embeddings.
  • Figure 4: Graph-based decoder. The utterance embeddings (1) are projected to different representations (2, 3, 56) using four feed-forward networks to flexibly represent utterance embeddings. The scores of the adjacent matrix and the probability tensor are computed with the affine attention product.
  • Figure 5: Span Attention module adapted from vaswani2017attention. The tensor of word embeddings ($\mathbf{W}_1\cdots \mathbf{W}_m$) from the encoder (Figure \ref{['fig:encoder1']}) and the effect contextualizations ($\mathbf{E}$) from the decoder (Figure \ref{['fig:decoder1']}) are passed to the attention product using each $\mathbf{W}_i$ as key and $\mathbf{E}$ as query and value matrices.
  • ...and 1 more figures