Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis
Xincheng Wang, Liejun Wang, Yinfeng Yu, Xinxin Jiao
TL;DR
This work tackles missing data in multimodal sentiment analysis by introducing MITR-DNet, which unifies modality-invariant bidirectional temporal representation learning (MIB-TRL) with a Transformer Fusion strategy and a distillation framework from a complete-modality teacher to a missing-modality student. By jointly addressing reconstruction and representation learning, the approach mitigates inter-modal heterogeneity while preserving informative signals when modalities are absent. Empirical results on CMU-MOSI and CH-SIMS show strong performance gains in both complete and incomplete modality settings, with ablations validating the necessity of the TF module and the distillation-reconstruction-SimSiam loss trio. The proposed framework offers practical resilience for real-world MSA under random modality dropouts and heterogeneous data, advancing robustness in multimodal sentiment prediction.
Abstract
Multimodal Sentiment Analysis (MSA) integrates diverse modalities(text, audio, and video) to comprehensively analyze and understand individuals' emotional states. However, the real-world prevalence of incomplete data poses significant challenges to MSA, mainly due to the randomness of modality missing. Moreover, the heterogeneity issue in multimodal data has yet to be effectively addressed. To tackle these challenges, we introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis. MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing. Simultaneously, we developed the Modality-Invariant Bidirectional Temporal Representation Learning Module (MIB-TRL) to mitigate heterogeneity.
