LyS at SemEval-2024 Task 3: An Early Prototype for End-to-End Multimodal Emotion Linking as Graph-Based Parsing
Ana Ezquerro, David Vilares
TL;DR
This work tackles Multimodal Emotion Cause Analysis in Conversations by proposing an end-to-end graph-based parser that treats utterances as nodes and emotion-cause relations as labeled edges. A large multimodal encoder (text, image, and audio) contextualizes utterances, while a biaffine graph-based decoder predicts the adjacency structure and trigger spans, aided by a span-attention mechanism and speaker-relative encoding. The approach achieves a 7th-place finish in Subtask 1 (text-only) and provides post-evaluation insights for Subtask 2, highlighting the value of multimodal inputs—especially audio—and the importance of span prediction for learning. The results motivate lighter, distilled multimodal models and suggest practical paths for end-to-end emotion-cause analysis with graph-based parsing in real-world settings.
Abstract
This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs.
