Table of Contents
Fetching ...

Multimodal Sentiment Analysis Based on Causal Reasoning

Fuhai Chen, Pengpeng Huang, Xuri Ge, Jie Huang, Zishuo Bao

TL;DR

This paper tackles modality bias in multimodal sentiment analysis by introducing CounterFactual Multimodal Sentiment Analysis (CF-MSA), which leverages causal counterfactual reasoning to separate direct modality effects from joint multimodal signals. It formalizes a cause-effect model with a mediator and defines the total, direct, and indirect effects ($TE$, $NDE$, $TIE$) to guide debiasing, then implements CF-MSA as three branches (text, image, text-image synthesis) fused by a learned function and optimized with a novel intermodal bias loss. Experimental results on MVSA-Single and MVSA-Multiple show that CF-MSA achieves debiasing and state-of-the-art performance under various bias-removal conditions, with ablations validating the new objective $\\mathcal{L}_{ti}$ and the importance of non-uniform distributions for the learnable parameter $c$. The work provides a generalizable framework for debiased multimodal inference and offers open-source code and datasets to facilitate future research and application in practical sentiment analysis tasks.

Abstract

With the rapid development of multimedia, the shift from unimodal textual sentiment analysis to multimodal image-text sentiment analysis has obtained academic and industrial attention in recent years. However, multimodal sentiment analysis is affected by unimodal data bias, e.g., text sentiment is misleading due to explicit sentiment semantic, leading to low accuracy in the final sentiment classification. In this paper, we propose a novel CounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal counterfactual inference to construct multimodal sentiment causal inference. CF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity across modalities by differentiating the treatment variables between modalities. In addition, considering the information complementarity and bias differences between modalities, we propose a new optimisation objective to effectively integrate different modalities and reduce the inherent bias from each modality. Experimental results on two public datasets, MVSA-Single and MVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing capability and achieves new state-of-the-art performances. We will release the code and datasets to facilitate future research.

Multimodal Sentiment Analysis Based on Causal Reasoning

TL;DR

This paper tackles modality bias in multimodal sentiment analysis by introducing CounterFactual Multimodal Sentiment Analysis (CF-MSA), which leverages causal counterfactual reasoning to separate direct modality effects from joint multimodal signals. It formalizes a cause-effect model with a mediator and defines the total, direct, and indirect effects (, , ) to guide debiasing, then implements CF-MSA as three branches (text, image, text-image synthesis) fused by a learned function and optimized with a novel intermodal bias loss. Experimental results on MVSA-Single and MVSA-Multiple show that CF-MSA achieves debiasing and state-of-the-art performance under various bias-removal conditions, with ablations validating the new objective and the importance of non-uniform distributions for the learnable parameter . The work provides a generalizable framework for debiased multimodal inference and offers open-source code and datasets to facilitate future research and application in practical sentiment analysis tasks.

Abstract

With the rapid development of multimedia, the shift from unimodal textual sentiment analysis to multimodal image-text sentiment analysis has obtained academic and industrial attention in recent years. However, multimodal sentiment analysis is affected by unimodal data bias, e.g., text sentiment is misleading due to explicit sentiment semantic, leading to low accuracy in the final sentiment classification. In this paper, we propose a novel CounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal counterfactual inference to construct multimodal sentiment causal inference. CF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity across modalities by differentiating the treatment variables between modalities. In addition, considering the information complementarity and bias differences between modalities, we propose a new optimisation objective to effectively integrate different modalities and reduce the inherent bias from each modality. Experimental results on two public datasets, MVSA-Single and MVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing capability and achieves new state-of-the-art performances. We will release the code and datasets to facilitate future research.

Paper Structure

This paper contains 21 sections, 27 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The upper part illustrates traditional likelihood-based multimodal sentiment prediction, where models rely on biased dominant modalities (e.g., explicit emotional words in text). It contains certainly misleading. The lower part demonstrates our counterfactual reasoning approach, which mitigates such bias by analyzing the impact of missing modalities, leading to more unbiased multimodal predictions.
  • Figure 2: The analysis of the unimodal $TIE$ effect for image and text is shown on the left, where $T^*$ and $I^*$ indicate that text and image information is masked respectively, and the red crosses indicate that the path is blocked. In this case, the model relies only on unimodal information for sentiment prediction. The joint text-image effect ($TIE_{\text{joint}}$) is shown on the right.
  • Figure 3: CF-MSA model training consists of three main branches: the text branch ($Z_t$), the image branch ($Z_i$), and the text-image integration branch ($z_{k}$). The testing phase uses causal counterfactual inference to make unbiased sentiment label predictions.
  • Figure 4: Qualitative analysis of test set examples. The first probability distribution chart is the prediction result of the traditional model, the second is the prediction result after removing the text bias, the third is the prediction result after removing the image bias, and the fourth is the result after removing both image and text biases. Red indicates a label that was incorrectly predicted, while green indicates a label that was correctly predicted.
  • Figure 5: Cause-and-effect diagram examples and counterfactual situations.