Table of Contents
Fetching ...

Senti-iFusion: An Integrity-centered Hierarchical Fusion Framework for Multimodal Sentiment Analysis under Uncertain Modality Missingness

Liling Li, Guoyang Xu, Xiongri Shen, Zhifei Xu, Yanbo Zhang, Zhiguo Zhang, Zhenxi Song

TL;DR

Senti-iFusion tackles multimodal sentiment analysis under unknown inter- and intra-modality missingness by introducing an integrity-centered hierarchical fusion framework. It combines Integrity Estimation, Integrity-weighted Cross-modal Completion, and Integrity-guided Adaptive Fusion to recover and leverage sentiment cues from partially observed data. A dual-depth validation with semantic and feature-level losses, plus a progressive two-stage training regime, yields state-of-the-art performance on MOSI and MOSEI under challenging missing-data patterns. The approach enhances robustness and fine-grained sentiment understanding, with practical implications for real-world multimodal systems facing sensor unreliability and data loss.

Abstract

Multimodal Sentiment Analysis (MSA) is critical for human-computer interaction but faces challenges when the modalities are incomplete or missing. Existing methods often assume pre-defined missing modalities or fixed missing rates, limiting their real-world applicability. To address this challenge, we propose Senti-iFusion, an integrity-centered hierarchical fusion framework capable of handling both inter- and intra-modality missingness simultaneously. It comprises three hierarchical components: Integrity Estimation, Integrity-weighted Completion, and Integrity-guided Fusion. First, the Integrity Estimation module predicts the completeness of each modality and mitigates the noise caused by incomplete data. Second, the Integrity-weighted Cross-modal Completion module employs a novel weighting mechanism to disentangle consistent semantic structures from modality-specific representations, enabling the precise recovery of sentiment-related features across language, acoustic, and visual modalities. To ensure consistency in reconstruction, a dual-depth validation with semantic- and feature-level losses ensures consistent reconstruction at both fine-grained (low-level) and semantic (high-level) scales. Finally, the Integrity-guided Adaptive Fusion mechanism dynamically selects the dominant modality for attention-based fusion, ensuring that the most reliable modality, based on completeness and quality, contributes more significantly to the final prediction. Senti-iFusion employs a progressive training approach to ensure stable convergence. Experimental results on popular MSA datasets demonstrate that Senti-iFusion outperforms existing methods, particularly in fine-grained sentiment analysis tasks. The code and our proposed Senti-iFusion model will be publicly available.

Senti-iFusion: An Integrity-centered Hierarchical Fusion Framework for Multimodal Sentiment Analysis under Uncertain Modality Missingness

TL;DR

Senti-iFusion tackles multimodal sentiment analysis under unknown inter- and intra-modality missingness by introducing an integrity-centered hierarchical fusion framework. It combines Integrity Estimation, Integrity-weighted Cross-modal Completion, and Integrity-guided Adaptive Fusion to recover and leverage sentiment cues from partially observed data. A dual-depth validation with semantic and feature-level losses, plus a progressive two-stage training regime, yields state-of-the-art performance on MOSI and MOSEI under challenging missing-data patterns. The approach enhances robustness and fine-grained sentiment understanding, with practical implications for real-world multimodal systems facing sensor unreliability and data loss.

Abstract

Multimodal Sentiment Analysis (MSA) is critical for human-computer interaction but faces challenges when the modalities are incomplete or missing. Existing methods often assume pre-defined missing modalities or fixed missing rates, limiting their real-world applicability. To address this challenge, we propose Senti-iFusion, an integrity-centered hierarchical fusion framework capable of handling both inter- and intra-modality missingness simultaneously. It comprises three hierarchical components: Integrity Estimation, Integrity-weighted Completion, and Integrity-guided Fusion. First, the Integrity Estimation module predicts the completeness of each modality and mitigates the noise caused by incomplete data. Second, the Integrity-weighted Cross-modal Completion module employs a novel weighting mechanism to disentangle consistent semantic structures from modality-specific representations, enabling the precise recovery of sentiment-related features across language, acoustic, and visual modalities. To ensure consistency in reconstruction, a dual-depth validation with semantic- and feature-level losses ensures consistent reconstruction at both fine-grained (low-level) and semantic (high-level) scales. Finally, the Integrity-guided Adaptive Fusion mechanism dynamically selects the dominant modality for attention-based fusion, ensuring that the most reliable modality, based on completeness and quality, contributes more significantly to the final prediction. Senti-iFusion employs a progressive training approach to ensure stable convergence. Experimental results on popular MSA datasets demonstrate that Senti-iFusion outperforms existing methods, particularly in fine-grained sentiment analysis tasks. The code and our proposed Senti-iFusion model will be publicly available.

Paper Structure

This paper contains 34 sections, 22 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of a wrong prediction under a dual-level missingness scenario, where missing data across multiple modalities leads to incorrect sentiment classification.
  • Figure 2: Framework Overview. Senti-iFusion consists of three key components: Integrity Estimation (IE), Integrity-weighted Cross-modal Completion (IC), and Integrity-guided Adaptive Fusion (IF). Given multimodal inputs with unknown and mixed missingness, the model first estimates the integrity scores of each modality within the mini-batch. The predicted scores guide the integrity-weighted cross-modal feature completion, using dual-depth validation and a dual-level loss (semantic and fine-grained). Finally, the integrity-guided attention mechanism adaptively cooperates with the dominant modality and performs multi-scale fusion for sentiment prediction.
  • Figure 3: Sentiment score distributions of the MOSI dataset across the training, validation, and test splits, illustrating label imbalance and distribution shifts between splits.
  • Figure 4: Sentiment score distributions of the MOSEI dataset across the training, validation, and test splits, highlighting the pronounced class imbalance in real-world sentiment data.
  • Figure 5: Performance curves for a range of sample-level drop_rate values simulating progressively higher levels of unknown inter-modality missingness. Figures (a)--(c) show the corresponding MAE, ACC-5, and Non0-F1 scores on MOSI, respectively.
  • ...and 2 more figures