Table of Contents
Fetching ...

What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

TL;DR

The paper tackles why Multimodal In-Context Learning (M-ICL) works in large multimodal models by proposing a formal framework and applying it to open-source models like IDEFICS and OpenFlamingo across captioning, classification, and VQA. It systematically ablates modalities and evaluates retrieval-based context selection (RICES), finding that text content largely drives performance when both modalities are present, while images mainly impact image-to-text tasks. Retrieval-based strategies provide gains but largely operate as a soft copy of target-like demonstrations, revealing recency and majority-vote biases that limit true learning from demonstrations. The work highlights practical implications for deploying M-ICL, suggesting improvements via better retrieval and bias mitigation, and calls for further study on stronger models and more diverse prompts to realize genuine multimodal in-context learning benefits.

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

What Makes Multimodal In-Context Learning Work?

TL;DR

The paper tackles why Multimodal In-Context Learning (M-ICL) works in large multimodal models by proposing a formal framework and applying it to open-source models like IDEFICS and OpenFlamingo across captioning, classification, and VQA. It systematically ablates modalities and evaluates retrieval-based context selection (RICES), finding that text content largely drives performance when both modalities are present, while images mainly impact image-to-text tasks. Retrieval-based strategies provide gains but largely operate as a soft copy of target-like demonstrations, revealing recency and majority-vote biases that limit true learning from demonstrations. The work highlights practical implications for deploying M-ICL, suggesting improvements via better retrieval and bias mitigation, and calls for further study on stronger models and more diverse prompts to realize genuine multimodal in-context learning benefits.

Abstract

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl
Paper Structure (25 sections, 1 equation, 13 figures, 12 tables)

This paper contains 25 sections, 1 equation, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Empirical analysis of M-ICL behavior. 1. Images play a crucial role in image-to-text tasks. 2. M-ICL is mostly driven by text when the task includes both image and text as input. 3. For advanced M-ICL strategies ranking ICL examples by their similarity to the query, the LMM mostly does a majority vote over the demonstration pairs. 4. M-ICL copies the output of the last demonstration pair.
  • Figure 2: Influence of each modality on the M-ICL performance. We show (a) the 16 shot performances of M-ICL with different contexts: baseline context (green), demonstration without images (orange), or with random images (blue). For VQA (c), we also consider the case where questions $T$ of the demonstrations are removed (pink), or replaced by a random question (green). In (b), we show the evolution of performance when the number of shots varies.
  • Figure 3: M-ICL tends to output the most frequent words of the context. We show the frequency of the most common words (excluding stop words) and 3-grams in the COCO dataset, which is used to construct the context demonstrations. We comprare the words frequency of the model outputs, with normal (blue) and random images (orange and green), to the dataset words frequency (pink).
  • Figure 4: RICES improves M-ICL performances on most datasets. Score differences between RICES and random sampling, with a varying number of demonstrations and across various datasets, with their respective metrics.
  • Figure 5: Influence of each modality on RICES M-ICL performance. We show the 16 shot performances of RICES M-ICL with different contexts: baseline prompt (green), demonstrations without images (in orange), random images paired with responses from demonstrations sampled using RICES (in blue), and random responses paired with images from demonstrations sampled using RICES (purple).
  • ...and 8 more figures