Table of Contents
Fetching ...

Evaluation of data inconsistency for multi-modal sentiment analysis

Yufei Wang, Mengyue Wu

TL;DR

This work tackles the challenge of semantic conflict across modalities in multi-modal sentiment analysis by introducing DiffEmo, a benchmark derived from CH-SIMS v2.0 that partitions data into Mixed, Conflicting, and Aligned sets to stress-test cross-modal understanding. It systematically evaluates traditional fusion-based approaches and multimodal large language models using prompting and in-context learning, revealing pronounced performance degradation under conflicting data and highlighting the current limitations of MLLMs in multimodal emotion reasoning. The study also investigates fusion strategies and reports that transformer-based architectures can maintain more stable performance, while simple fusion can be advantageous for non-aligned data. Overall, the paper motivates the need for richer cross-modal reasoning and larger-scale video emotion entailment data to advance robust multimodal sentiment analysis in the presence of modality conflicts.

Abstract

Emotion semantic inconsistency is an ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offer valuable insights for the future development of sentiment analysis systems.

Evaluation of data inconsistency for multi-modal sentiment analysis

TL;DR

This work tackles the challenge of semantic conflict across modalities in multi-modal sentiment analysis by introducing DiffEmo, a benchmark derived from CH-SIMS v2.0 that partitions data into Mixed, Conflicting, and Aligned sets to stress-test cross-modal understanding. It systematically evaluates traditional fusion-based approaches and multimodal large language models using prompting and in-context learning, revealing pronounced performance degradation under conflicting data and highlighting the current limitations of MLLMs in multimodal emotion reasoning. The study also investigates fusion strategies and reports that transformer-based architectures can maintain more stable performance, while simple fusion can be advantageous for non-aligned data. Overall, the paper motivates the need for richer cross-modal reasoning and larger-scale video emotion entailment data to advance robust multimodal sentiment analysis in the presence of modality conflicts.

Abstract

Emotion semantic inconsistency is an ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offer valuable insights for the future development of sentiment analysis systems.
Paper Structure (9 sections, 1 equation, 3 figures, 6 tables)

This paper contains 9 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of multi-modal data conflicting samples. "M", "V", "A", "T" represent multimodal, visual, acoustic, textual label respectively.
  • Figure 2: Distribution of uni-modal and multi-modal labels for conflicting data.
  • Figure 3: Prompt and in-context learning for Video-LLaMA.