What Makes Multimodal In-Context Learning Work?
Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski
TL;DR
The paper tackles why Multimodal In-Context Learning (M-ICL) works in large multimodal models by proposing a formal framework and applying it to open-source models like IDEFICS and OpenFlamingo across captioning, classification, and VQA. It systematically ablates modalities and evaluates retrieval-based context selection (RICES), finding that text content largely drives performance when both modalities are present, while images mainly impact image-to-text tasks. Retrieval-based strategies provide gains but largely operate as a soft copy of target-like demonstrations, revealing recency and majority-vote biases that limit true learning from demonstrations. The work highlights practical implications for deploying M-ICL, suggesting improvements via better retrieval and bias mitigation, and calls for further study on stronger models and more diverse prompts to realize genuine multimodal in-context learning benefits.
Abstract
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl
