Table of Contents
Fetching ...

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif

TL;DR

This work proposes a simple and parameter-efficient adaptation procedure for pretrained multimodal networks that can partially bridge performance drop due to missing modalities, and outperforms existing methods for robust multimodal learning with missing modalities.

Abstract

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

TL;DR

This work proposes a simple and parameter-efficient adaptation procedure for pretrained multimodal networks that can partially bridge performance drop due to missing modalities, and outperforms existing methods for robust multimodal learning with missing modalities.

Abstract

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks. In particular, we exploit modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 1% of the total parameters) and applicable to a wide range of modality combinations and tasks. We conduct a series of experiments to highlight the missing modality robustness of our proposed method on five different multimodal tasks across seven datasets. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
Paper Structure (45 sections, 7 equations, 6 figures, 17 tables)

This paper contains 45 sections, 7 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: a) Overview of our model adaptation approach for robust MML. A model pretrained on all the modalities is adapted using a small number of learnable parameters to handle different modality combinations. We insert adaptable layers after each layer of the encoders and the fusion block to learn the modulation as a function of the available input modalities to compensate for the missing modalities. The grayed-out branch (missing modality) is inactive and does not contribute to the output. b) Low-rank model adaption computes features using frozen weights and low-rank weight updates and combine them. c) Scale and shift feature adaptation transforms input by element-wise multiplication and addition.
  • Figure 2: Examples of predicted segmentation maps for the Pretrained and Adapted models. Title above each subimage shows method name (available modalities). CMNeXt column shows the predictions with all the modalities. Segmentation quality improves significantly after model adaptation for all input modality combinations. Green boxes highlight areas with salient differences in results (e.g., cars and humans missing in the Pretrained model with missing modalities but visible in the Adapted model). For MCubeS dataset, we only show RGB input images for brevity. A, D and N denote angle of linear polarization, degree of linear polarization, and near-infrared, respectively.
  • Figure 3: Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) on MCubeS and NTU RGB+D datasets. Adapted models show higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.
  • Figure S1: Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) under different missing modality scenarios on MCubeS dataset. The adapted model shows higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.
  • Figure S2: Cosine similarity between complete and missing modality features of the pretrained model (Pretrained) and complete and missing modality features of the adapted model (Adapted) under different missing modality scenarios on NTU RGB+D dataset. Adapted models show comparable/higher similarity to the complete modality features compared to the pretrained model, indicating less deviation and better handling of missing modalities.
  • ...and 1 more figures