Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
Zhuo Zhi, Ziquan Liu, Moe Elbadawi, Adam Daneshmend, Mine Orlu, Abdul Basit, Andreas Demosthenous, Miguel Rodrigues
TL;DR
This work tackles multimodal learning with missing modalities under data scarcity by introducing a semi-parametric, retrieval-augmented in-context learning (ICL) framework that sits on top of a frozen pretrained multimodal transformer (e.g., ViLT). For each sample, the method retrieves $Q$ nearest full-modality neighbors using cosine similarity on the CLS token, aggregates their context, and feeds it into an ICL module that can be configured as cross-attention (ICL-CA) or next-token prediction (ICL-NTP). Only the ICL module is trained, while the backbone remains fixed, enabling high data efficiency in low-data regimes. Across four benchmarks spanning medical and vision-language tasks, the proposed ICL-CA method yields an average improvement of $6.1\%$ over the strong MAP baseline in low-data settings and substantially narrows the performance gap between missing- and full-modality data; it also demonstrates favorable inference latency and adaptable neighbor-context utilization. The results highlight the value of data-centric context augmentation for robust multimodal learning when data are scarce and some modalities are unavailable, with future work targeting retrieval efficiency and extension to more modalities.
Abstract
Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.
