Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis
Xianbing Zhao, Lizhen Qu, Tao Feng, Jianfei Cai, Buzhou Tang
TL;DR
This work tackles domain generalization in multimodal sentiment analysis under distribution shifts where target-domain data is unavailable. It introduces the S^2LIF framework, a sequential strategy that first learns domain-invariant textual features $x_t^c$ via learnable masks, then derives domain-invariant video features $x_v^c$ conditioned on $x_t^c$, optimized through losses $\mathcal{L}_t$ and $\mathcal{L}_v$ with a sparse regularization, overall $\mathcal{L} = \mathcal{L}_t + \mathcal{L}_v$. Empirically, on CMU-MOSI, CMU-MOSEI, and MELD, S^2LIF achieves superior DG performance versus strong baselines in both single-source and multi-source settings, with the learned features displaying sparsity, cross-modal independence, and strong correlation with sentiment labels. Analyses including feature existence, cross-/intra-modal correlations, ablations, and case studies support the effectiveness and interpretability of the sequential masking approach, and the authors plan to release code publicly to facilitate replication and extension.
Abstract
This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.
