Table of Contents
Fetching ...

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

Donggeun Kim, Taesup Kim

TL;DR

This work tackles missing modalities in unpaired multimodal learning by leveraging independently pretrained unimodal encoders and parameter-efficient fine-tuning. It introduces a feature predictor guided by read-only prompts and a VICReg-based objective to predict missing modality embeddings, enabling effective late fusion without full multimodal pretraining. The approach achieves strong, robust performance across MM-IMDb, UPMC Food-101, and Hateful Memes under both complete and missing-training settings, with BitFit as a highly parameter-efficient fine-tuning option. The method demonstrates practical applicability in low-resource, multilingual, or domain-specific contexts where large paired multimodal data or joint encoders are unavailable.

Abstract

Multimodal learning typically relies on the assumption that all modalities are fully available during both the training and inference phases. However, in real-world scenarios, consistently acquiring complete multimodal data presents significant challenges due to various factors. This often leads to the issue of missing modalities, where data for certain modalities are absent, posing considerable obstacles not only for the availability of multimodal pretrained models but also for their fine-tuning and the preservation of robustness in downstream tasks. To address these challenges, we propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method. This framework enables the model to predict the embedding of a missing modality in the representation space during inference. Our method effectively predicts the missing embedding through prompt tuning, leveraging information from available modalities. We evaluate our approach on several multimodal benchmark datasets and demonstrate its effectiveness and robustness across various scenarios of missing modalities.

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models

TL;DR

This work tackles missing modalities in unpaired multimodal learning by leveraging independently pretrained unimodal encoders and parameter-efficient fine-tuning. It introduces a feature predictor guided by read-only prompts and a VICReg-based objective to predict missing modality embeddings, enabling effective late fusion without full multimodal pretraining. The approach achieves strong, robust performance across MM-IMDb, UPMC Food-101, and Hateful Memes under both complete and missing-training settings, with BitFit as a highly parameter-efficient fine-tuning option. The method demonstrates practical applicability in low-resource, multilingual, or domain-specific contexts where large paired multimodal data or joint encoders are unavailable.

Abstract

Multimodal learning typically relies on the assumption that all modalities are fully available during both the training and inference phases. However, in real-world scenarios, consistently acquiring complete multimodal data presents significant challenges due to various factors. This often leads to the issue of missing modalities, where data for certain modalities are absent, posing considerable obstacles not only for the availability of multimodal pretrained models but also for their fine-tuning and the preservation of robustness in downstream tasks. To address these challenges, we propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method. This framework enables the model to predict the embedding of a missing modality in the representation space during inference. Our method effectively predicts the missing embedding through prompt tuning, leveraging information from available modalities. We evaluate our approach on several multimodal benchmark datasets and demonstrate its effectiveness and robustness across various scenarios of missing modalities.
Paper Structure (31 sections, 4 equations, 10 figures, 6 tables)

This paper contains 31 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Our framework overview: Utilizing separate unimodal encoders, we employ PEFT for target tasks and introduce trainable prompts for effectively predicting missing modality features, with attention masking to preserve input's representations. During inference, model can generate embeddings of missing modality at hand.
  • Figure 2: Performance on multimodal classification datasets under a complete training setting. All experiments were conducted with 100% text and 100% image data and evaluated based on the text missing rate. Since the official code from Ma_2022_CVPR is unavailable, we have referenced the results for the multimodal model from prior work Ma_2022_CVPR. Unimodal-based methods were evaluated by averaging performances obtained with five different seeds. Dotted lines indicate multimodal models, whereas solid lines represent unimodal models.
  • Figure 3: Ablation on effect of the prompts under complete training setting.
  • Figure 4: t-SNE visualization of (a) image feature, (b) text feature. Feature prediction with prompts (green) results in embeddings that align more closely with the ground truth features (blue), compared to prediction without prompts (red).
  • Figure 5: Ablation study on different length of prompts and attention masking.
  • ...and 5 more figures