Evaluation of data inconsistency for multi-modal sentiment analysis
Yufei Wang, Mengyue Wu
TL;DR
This work tackles the challenge of semantic conflict across modalities in multi-modal sentiment analysis by introducing DiffEmo, a benchmark derived from CH-SIMS v2.0 that partitions data into Mixed, Conflicting, and Aligned sets to stress-test cross-modal understanding. It systematically evaluates traditional fusion-based approaches and multimodal large language models using prompting and in-context learning, revealing pronounced performance degradation under conflicting data and highlighting the current limitations of MLLMs in multimodal emotion reasoning. The study also investigates fusion strategies and reports that transformer-based architectures can maintain more stable performance, while simple fusion can be advantageous for non-aligned data. Overall, the paper motivates the need for richer cross-modal reasoning and larger-scale video emotion entailment data to advance robust multimodal sentiment analysis in the presence of modality conflicts.
Abstract
Emotion semantic inconsistency is an ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offer valuable insights for the future development of sentiment analysis systems.
