Table of Contents
Fetching ...

MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection

Niki Nezakati, Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif

TL;DR

MMP is proposed, a method designed to train a single model that is robust to any missing modality scenario by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities.

Abstract

Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. These approaches are either tied to specific modalities or become computationally expensive as the number of input modalities increases. In this paper, we propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario. We achieve this by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities. This approach enables the model to effectively learn to leverage the information from the available modalities to compensate for the missing ones, enhancing missing modality robustness. We conduct a series of experiments with various baseline models and datasets to assess the effectiveness of this strategy. Experiments demonstrate that our approach improves robustness to different missing modality scenarios, outperforming existing methods designed for missing modalities or specific modality combinations.

MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection

TL;DR

MMP is proposed, a method designed to train a single model that is robust to any missing modality scenario by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities.

Abstract

Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. These approaches are either tied to specific modalities or become computationally expensive as the number of input modalities increases. In this paper, we propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario. We achieve this by randomly masking a subset of modalities during training and learning to project available input modalities to estimate the tokens for the masked modalities. This approach enables the model to effectively learn to leverage the information from the available modalities to compensate for the missing ones, enhancing missing modality robustness. We conduct a series of experiments with various baseline models and datasets to assess the effectiveness of this strategy. Experiments demonstrate that our approach improves robustness to different missing modality scenarios, outperforming existing methods designed for missing modalities or specific modality combinations.
Paper Structure (29 sections, 8 equations, 6 figures, 5 tables)

This paper contains 29 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Architecture of the proposed MMP approach for training a single multimodal model that is robust to missing modalities. Input modalities are passed through embedding layers, generating tokens. For a masked modality $i$, a projection function utilizes the tokens from the available modalities to generate projected tokens. These projected tokens are then passed to the masked modality branch.
  • Figure 2: Visualization of the modality projection approach. Available modality tokens are processed through cross-attention to update their aggregated tokens. These aggregated tokens are combined with those of the masked modality through another cross-attention step. The resulting cross-modal relationships are used to attend to the actual tokens of the available modalities. The final output tokens are passed through an MLP to generate the projected tokens of the masked modality.
  • Figure 3: Visualization of predicted segmentation maps for the Pretrained (CMNeXt) model and our MMP approach. Title above each image indicates the method name (available modalities). Blue boxes mark the areas where the differences are more prominent. A and D denote angle and degree of linear polarization, respectively.
  • Figure S1: Average cosine similarity between model predictions with real and projected tokens on UPMC Food-101 dataset when image is missing. We substitute the missing modality tokens with the projected tokens.
  • Figure S2: Average cosine similarity between model predictions with real and projected tokens on UPMC Food-101 dataset when text is missing. We substitute the missing modality tokens with the projected tokens.
  • ...and 1 more figures