Table of Contents
Fetching ...

Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space

Zhuyang Xie, Yan Yang, Jie Wang, Xiaorong Liu, Xiaofan Li

TL;DR

This work addresses the reliability challenges of multimodal sentiment analysis by introducing TMSON, a framework that explicitly models unimodal uncertainty distributions, fuses them through Bayesian fusion to form a robust multimodal representation, and enforces ordinal structure in the sentiment space via an ordinal regression loss. The method integrates unimodal feature extraction, uncertainty estimation, and probabilistic fusion with a multitask objective to learn both modality-specific and shared representations. Empirical results on CMU-MOSI, CMU-MOSEI, and SIMS demonstrate that TMSON achieves superior accuracy and robustness, particularly under noise and missing modalities, while providing interpretability through uncertainty and ordinal space visualizations. The approach advances trustworthy multimodal inference with practical impact on sentiment analysis tasks in noisy real-world settings.

Abstract

Multimodal video sentiment analysis aims to integrate multiple modal information to analyze the opinions and attitudes of speakers. Most previous work focuses on exploring the semantic interactions of intra- and inter-modality. However, these works ignore the reliability of multimodality, i.e., modalities tend to contain noise, semantic ambiguity, missing modalities, etc. In addition, previous multimodal approaches treat different modalities equally, largely ignoring their different contributions. Furthermore, existing multimodal sentiment analysis methods directly regress sentiment scores without considering ordinal relationships within sentiment categories, with limited performance. To address the aforementioned problems, we propose a trustworthy multimodal sentiment ordinal network (TMSON) to improve performance in sentiment analysis. Specifically, we first devise a unimodal feature extractor for each modality to obtain modality-specific features. Then, an uncertainty distribution estimation network is customized, which estimates the unimodal uncertainty distributions. Next, Bayesian fusion is performed on the learned unimodal distributions to obtain multimodal distributions for sentiment prediction. Finally, an ordinal-aware sentiment space is constructed, where ordinal regression is used to constrain the multimodal distributions. Our proposed TMSON outperforms baselines on multimodal sentiment analysis tasks, and empirical results demonstrate that TMSON is capable of reducing uncertainty to obtain more robust predictions.

Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space

TL;DR

This work addresses the reliability challenges of multimodal sentiment analysis by introducing TMSON, a framework that explicitly models unimodal uncertainty distributions, fuses them through Bayesian fusion to form a robust multimodal representation, and enforces ordinal structure in the sentiment space via an ordinal regression loss. The method integrates unimodal feature extraction, uncertainty estimation, and probabilistic fusion with a multitask objective to learn both modality-specific and shared representations. Empirical results on CMU-MOSI, CMU-MOSEI, and SIMS demonstrate that TMSON achieves superior accuracy and robustness, particularly under noise and missing modalities, while providing interpretability through uncertainty and ordinal space visualizations. The approach advances trustworthy multimodal inference with practical impact on sentiment analysis tasks in noisy real-world settings.

Abstract

Multimodal video sentiment analysis aims to integrate multiple modal information to analyze the opinions and attitudes of speakers. Most previous work focuses on exploring the semantic interactions of intra- and inter-modality. However, these works ignore the reliability of multimodality, i.e., modalities tend to contain noise, semantic ambiguity, missing modalities, etc. In addition, previous multimodal approaches treat different modalities equally, largely ignoring their different contributions. Furthermore, existing multimodal sentiment analysis methods directly regress sentiment scores without considering ordinal relationships within sentiment categories, with limited performance. To address the aforementioned problems, we propose a trustworthy multimodal sentiment ordinal network (TMSON) to improve performance in sentiment analysis. Specifically, we first devise a unimodal feature extractor for each modality to obtain modality-specific features. Then, an uncertainty distribution estimation network is customized, which estimates the unimodal uncertainty distributions. Next, Bayesian fusion is performed on the learned unimodal distributions to obtain multimodal distributions for sentiment prediction. Finally, an ordinal-aware sentiment space is constructed, where ordinal regression is used to constrain the multimodal distributions. Our proposed TMSON outperforms baselines on multimodal sentiment analysis tasks, and empirical results demonstrate that TMSON is capable of reducing uncertainty to obtain more robust predictions.
Paper Structure (47 sections, 21 equations, 9 figures, 8 tables)

This paper contains 47 sections, 21 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of the differences between previous multimodal fusion methods and our trustworthy multimodal fusion method. (a) The aim of multimodal fusion is to integrate different modal information into a consistent representation for sentiment analysis without providing reliability for the prediction results. (b) Trustworthy multimodal fusion, in contrast, estimates the uncertainty of different modalities and uses uncertainty distribution fusion for more robust sentiment prediction. The numbers in brackets indicate uncertainty scores, with larger values indicating higher uncertainty.
  • Figure 2: The diagram of the TMSON framework. The input to the network is multimodal sequences (text, visual, and audio). For each modality, we observe unimodal representation (a) and estimate the uncertainty distribution (b), after which we fuse these distributions to obtain a consistent multimodal distribution (c). Ultimately, ordinal regression is introduced to constrain the fused multimodal distribution to be ordinal (d).
  • Figure 3: Triplet loss for regression. Blue circles represent anchors, and green circles represent reference points. (a) is the case where the samples are easily distinguishable, where the difference between $|y_a - y_{r}|$ and $|y_a - y_{h}|$ is much larger than $\xi$. (b) is the hard triplet, where the difference between $|y_a - y_{r}|$ and $|y_a - y_{h}|$ is smaller than $\xi$ and difficult to distinguish.
  • Figure 4: Weight analysis of different loss terms.
  • Figure 5: Capturing data uncertainty. The curves of different colors reflect the uncertainty distribution under different noise intensities.
  • ...and 4 more figures