Table of Contents
Fetching ...

Toward Robust Multimodal Learning using Multimodal Foundational Models

Xianbing Zhao, Soujanya Poria, Xuejiao Li, Yixin Chen, Buzhou Tang

TL;DR

This work tackles robustness in multimodal sentiment analysis under incomplete data using TRML, a framework that extends CLIP-based multimodal foundational models by generating virtual modalities for missing data and aligning semantic spaces through a semantic-matching objective. TRML comprises a Missing Modality Inference module to synthesize virtual visual/text modalities and a Semantic Matching Learning module to align the semantics between original and generated modalities, trained with a combined loss $igL = igL_{task} + abla igL_{sml}$ and hyperparameters $igalpha$ and $ au$. Empirical results on CMU-MOSI, CMU-MOSEI, and MELD show that TRML outperforms prior methods, approaches the upper bound when certain modalities are missing, and remains robust across settings where missingness occurs during training and testing. The work provides strong evidence that leveraging latent cross-modal semantic correlations in foundational models, together with targeted inference and alignment modules, can substantially improve robustness in real-world multimodal tasks. Overall, TRML offers a scalable and effective path to robust multimodal learning with incomplete data, with potential extensions to additional modalities and domains.

Abstract

Existing multimodal sentiment analysis tasks are highly rely on the assumption that the training and test sets are complete multimodal data, while this assumption can be difficult to hold: the multimodal data are often incomplete in real-world scenarios. Therefore, a robust multimodal model in scenarios with randomly missing modalities is highly preferred. Recently, CLIP-based multimodal foundational models have demonstrated impressive performance on numerous multimodal tasks by learning the aligned cross-modal semantics of image and text pairs, but the multimodal foundational models are also unable to directly address scenarios involving modality absence. To alleviate this issue, we propose a simple and effective framework, namely TRML, Toward Robust Multimodal Learning using Multimodal Foundational Models. TRML employs generated virtual modalities to replace missing modalities, and aligns the semantic spaces between the generated and missing modalities. Concretely, we design a missing modality inference module to generate virtual modaliites and replace missing modalities. We also design a semantic matching learning module to align semantic spaces generated and missing modalities. Under the prompt of complete modality, our model captures the semantics of missing modalities by leveraging the aligned cross-modal semantic space. Experiments demonstrate the superiority of our approach on three multimodal sentiment analysis benchmark datasets, CMU-MOSI, CMU-MOSEI, and MELD.

Toward Robust Multimodal Learning using Multimodal Foundational Models

TL;DR

This work tackles robustness in multimodal sentiment analysis under incomplete data using TRML, a framework that extends CLIP-based multimodal foundational models by generating virtual modalities for missing data and aligning semantic spaces through a semantic-matching objective. TRML comprises a Missing Modality Inference module to synthesize virtual visual/text modalities and a Semantic Matching Learning module to align the semantics between original and generated modalities, trained with a combined loss and hyperparameters and . Empirical results on CMU-MOSI, CMU-MOSEI, and MELD show that TRML outperforms prior methods, approaches the upper bound when certain modalities are missing, and remains robust across settings where missingness occurs during training and testing. The work provides strong evidence that leveraging latent cross-modal semantic correlations in foundational models, together with targeted inference and alignment modules, can substantially improve robustness in real-world multimodal tasks. Overall, TRML offers a scalable and effective path to robust multimodal learning with incomplete data, with potential extensions to additional modalities and domains.

Abstract

Existing multimodal sentiment analysis tasks are highly rely on the assumption that the training and test sets are complete multimodal data, while this assumption can be difficult to hold: the multimodal data are often incomplete in real-world scenarios. Therefore, a robust multimodal model in scenarios with randomly missing modalities is highly preferred. Recently, CLIP-based multimodal foundational models have demonstrated impressive performance on numerous multimodal tasks by learning the aligned cross-modal semantics of image and text pairs, but the multimodal foundational models are also unable to directly address scenarios involving modality absence. To alleviate this issue, we propose a simple and effective framework, namely TRML, Toward Robust Multimodal Learning using Multimodal Foundational Models. TRML employs generated virtual modalities to replace missing modalities, and aligns the semantic spaces between the generated and missing modalities. Concretely, we design a missing modality inference module to generate virtual modaliites and replace missing modalities. We also design a semantic matching learning module to align semantic spaces generated and missing modalities. Under the prompt of complete modality, our model captures the semantics of missing modalities by leveraging the aligned cross-modal semantic space. Experiments demonstrate the superiority of our approach on three multimodal sentiment analysis benchmark datasets, CMU-MOSI, CMU-MOSEI, and MELD.
Paper Structure (20 sections, 10 equations, 9 figures, 7 tables)

This paper contains 20 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Scenarios with missing modalities. Taking the missing text modality as an example, black indicates that the visual modality is the victim modality. (a): Modalities are complete. (b): Modalities are complete in the training set and victim modality is completely missing in the test set. (c): Modalities are missing in both training and test sets. Victim modality is missing randomly in the training set and completely missing in the test set. (d) Modalities are missing randomly in both training and test sets with the same probability.
  • Figure 2: Schematic illustration of the proposed TRML framework. Our model comprises three components: 1) Multimodal Foundational Model that learns representations for latent semantic alignment; 2) Missing Modality Inference module, taking missing text modality as an example, utilizing visual modality as prompt to generate a virtual missing text modality; 3) Semantic Match Learning module aligns the semantic space of the virtual text modality with the original text modality, enabling virtual modality to learn the similar semantic aspects of the missing text modality.
  • Figure 3: The semantic similarity matrix between the generated virtual modality and the virtual modality of randomly selected 8 samples on the CMU-MOSI dataset.
  • Figure 4: Performance of different multimodal foundational models in scenarios with missing modality on CMU-MOSI dataset.
  • Figure 5: Performance comparisons of different model using the same multimodal foundational model on CMU-MOSI and CMU-MOSEI. The pvalue$<$0.05 of significance test(t-test) in all Setting.
  • ...and 4 more figures