Table of Contents
Fetching ...

Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity

Zhuo Zhi, Ziquan Liu, Moe Elbadawi, Adam Daneshmend, Mine Orlu, Abdul Basit, Andreas Demosthenous, Miguel Rodrigues

TL;DR

This work tackles multimodal learning with missing modalities under data scarcity by introducing a semi-parametric, retrieval-augmented in-context learning (ICL) framework that sits on top of a frozen pretrained multimodal transformer (e.g., ViLT). For each sample, the method retrieves $Q$ nearest full-modality neighbors using cosine similarity on the CLS token, aggregates their context, and feeds it into an ICL module that can be configured as cross-attention (ICL-CA) or next-token prediction (ICL-NTP). Only the ICL module is trained, while the backbone remains fixed, enabling high data efficiency in low-data regimes. Across four benchmarks spanning medical and vision-language tasks, the proposed ICL-CA method yields an average improvement of $6.1\%$ over the strong MAP baseline in low-data settings and substantially narrows the performance gap between missing- and full-modality data; it also demonstrates favorable inference latency and adaptable neighbor-context utilization. The results highlight the value of data-centric context augmentation for robust multimodal learning when data are scarce and some modalities are unavailable, with future work targeting retrieval efficiency and extension to more modalities.

Abstract

Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.

Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity

TL;DR

This work tackles multimodal learning with missing modalities under data scarcity by introducing a semi-parametric, retrieval-augmented in-context learning (ICL) framework that sits on top of a frozen pretrained multimodal transformer (e.g., ViLT). For each sample, the method retrieves nearest full-modality neighbors using cosine similarity on the CLS token, aggregates their context, and feeds it into an ICL module that can be configured as cross-attention (ICL-CA) or next-token prediction (ICL-NTP). Only the ICL module is trained, while the backbone remains fixed, enabling high data efficiency in low-data regimes. Across four benchmarks spanning medical and vision-language tasks, the proposed ICL-CA method yields an average improvement of over the strong MAP baseline in low-data settings and substantially narrows the performance gap between missing- and full-modality data; it also demonstrates favorable inference latency and adaptable neighbor-context utilization. The results highlight the value of data-centric context augmentation for robust multimodal learning when data are scarce and some modalities are unavailable, with future work targeting retrieval efficiency and extension to more modalities.

Abstract

Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.
Paper Structure (14 sections, 3 equations, 6 figures, 8 tables)

This paper contains 14 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The overview of the proposed method. (a) Assuming that each sample contains data with 2 modalities $x_i^{m_1}$ and $x_i^{m_2}$, we get the feature $H_i = ({H_i}^{m_1},{H_i}^{m_2},cls_i)$ of the sample by using a pre-trained multimodal transformer, note that $x_i^{m_1}$ or $x_i^{m_2}$ may be missed. (b) We use the $cls$ token to calculate the cosine similarity between the current sample and all full-modality training samples, and then retrieve the most similar $Q$ samples. (c) We input the pooled feature of the current sample $\tilde{H}_i$ and neighbor samples $\tilde{H}^{NN}_i$ into the ICL module to predict the label ${\hat{y}}_i$. Note that only the ICL module requires to be trained and the others are frozen. The retrieval-augmented operation is the same for both the training and inference processes. Note that the words ‘missing modality' and 'incomplete modality', 'full modality' and 'complete modality' are used interchangeably.
  • Figure 2: (a) The learning curve of ICL-CA (ours) and Missing-Aware Prompt (MAP) lee2023multimodal on the Food-101 dataset in the low data regime. (b) The learning curve of two methods on the MedFuse-I dataset. The subsampling ratio is set to be 0.01. The difference in the learning steps is due to early stopping. During each training process, we calculate the metric of missing- and full-modality samples separately and refer to them as '-full' and '-miss'. (c) The performance of MAP and our ICL-CA on four multimodality datasets with the missing-modality issue. The y-axis shows the dataset name and missing status. The x-axis is metrics for each dataset, AUROC for MedFuse-I, MedFuse-P and HatefulMemes, and accuracy for Food-101. On each dataset, we compute the metric for test data with full and missing modalities separately and show the results in dark and light color. The legend means (Method, Full/Missed-Modality). When the task complexity is low, e.g., binary classification tasks like HatefulMemes kiela2020hateful and MedFuse-I hayat2022medfuse, the performance of full-modality data lags behind that of missing-modality data, as fitting the training data does not need full-modality information. When the task complexity is high, e.g., a multi-classification task like Food-101 (101 classes) bossard14, the full-modality performance surpasses that of the missing-modality, as the task requires all modalities to adequately model the training data. Our ICL is significantly better than MAP on four datasets in four cases as the table below shows. See more details in Sec. \ref{['sec:Experiment']}.
  • Figure 3: The illustration of two ICL approaches. (a) ICL by cross attention. (b) ICL by next-token prediction. The yellow and green tokens denote features of two different modalities and the blue token is the $cls$ token.
  • Figure 4: The performance of MAP and ICL-CA on MedFuse-I and MedFuse-P when using different training set sizes. Our proposed ICL method is highly competitive under low data cases ($r_{sub}$ from 0.01 to 0.1). Crucially, our approach enhances the performance in both full- and missing-modalities, outperforming the MAP baseline.
  • Figure 5: The performance of MAP and ICL-CA on HatefulMemes and Food-101 when using different training set sizes. The performance of our ICL-CA is much better than that of MAP in the low-data regime ($r_{sub}$ from 0.01 to 0.1).
  • ...and 1 more figures