Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

Xianbing Zhao; Lizhen Qu; Tao Feng; Jianfei Cai; Buzhou Tang

Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

Xianbing Zhao, Lizhen Qu, Tao Feng, Jianfei Cai, Buzhou Tang

TL;DR

This work tackles domain generalization in multimodal sentiment analysis under distribution shifts where target-domain data is unavailable. It introduces the S^2LIF framework, a sequential strategy that first learns domain-invariant textual features $x_t^c$ via learnable masks, then derives domain-invariant video features $x_v^c$ conditioned on $x_t^c$, optimized through losses $\mathcal{L}_t$ and $\mathcal{L}_v$ with a sparse regularization, overall $\mathcal{L} = \mathcal{L}_t + \mathcal{L}_v$. Empirically, on CMU-MOSI, CMU-MOSEI, and MELD, S^2LIF achieves superior DG performance versus strong baselines in both single-source and multi-source settings, with the learned features displaying sparsity, cross-modal independence, and strong correlation with sentiment labels. Analyses including feature existence, cross-/intra-modal correlations, ablations, and case studies support the effectiveness and interpretability of the sequential masking approach, and the authors plan to release code publicly to facilitate replication and extension.

Abstract

This work proposes a novel and simple sequential learning strategy to train models on videos and texts for multimodal sentiment analysis. To estimate sentiment polarities on unseen out-of-distribution data, we introduce a multimodal model that is trained either in a single source domain or multiple source domains using our learning strategy. This strategy starts with learning domain invariant features from text, followed by learning sparse domain-agnostic features from videos, assisted by the selected features learned in text. Our experimental results demonstrate that our model achieves significantly better performance than the state-of-the-art approaches on average in both single-source and multi-source settings. Our feature selection procedure favors the features that are independent to each other and are strongly correlated with their polarity labels. To facilitate research on this topic, the source code of this work will be publicly available upon acceptance.

Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

TL;DR

via learnable masks, then derives domain-invariant video features

conditioned on

, optimized through losses

and

with a sparse regularization, overall

. Empirically, on CMU-MOSI, CMU-MOSEI, and MELD, S^2LIF achieves superior DG performance versus strong baselines in both single-source and multi-source settings, with the learned features displaying sparsity, cross-modal independence, and strong correlation with sentiment labels. Analyses including feature existence, cross-/intra-modal correlations, ablations, and case studies support the effectiveness and interpretability of the sequential masking approach, and the authors plan to release code publicly to facilitate replication and extension.

Abstract

Paper Structure (30 sections, 20 equations, 9 figures, 3 tables)

This paper contains 30 sections, 20 equations, 9 figures, 3 tables.

Introduction
Related Work
Multimodal Sentiment Analysis
Domain Generalization
Causal Representation Learning
Method
Problem Statement.
Model Overview.
Keyframe-aware Masking.
Sequential Multimodal Learning.
Multimodal Learnable Masks.
Learning Objective.
Experiments
Datasets
Implementation Detail
...and 15 more sections

Figures (9)

Figure 1: Classifiers employ learnable masks to identify domain-invariant text features first, conditioned on which the classifiers learn domain-invariant features from videos.
Figure 2: An overview of our proposed framework.
Figure 3: (a): The causal structure of the data generation process involves direct causal effects from $x_1$ and $x_2$ to $Y$. There exists a causal relationship between $x_2$ and $x_3$. $\epsilon$ represents independent noise. The latent variable $U$ serves as a confounder for $x_1$ and $x_3$. (b) Severing the edge between $x_2$ and $x_3$ and eliminating the causal relationship.
Figure 4: Visualization of domain-invariant features across domain.
Figure 5: The proportion of domain-invariant features.
...and 4 more figures

Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

TL;DR

Abstract

Learning in Order! A Sequential Strategy to Learn Invariant Features for Multimodal Sentiment Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)