Table of Contents
Fetching ...

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen

TL;DR

This work probes why multimodal in-context learning works by systematically varying demonstration modalities and selecting strategies across model scales and a broad task suite. It reveals that visual and textual information contribute differently depending on the task, and that modality-aware demonstration selection can substantially boost performance. The authors show that models can encode task inductive biases from demonstrations, sometimes overriding pretraining priors, with dual-modality strategies offering robust gains. The findings offer practical guidelines for constructing demonstrations to improve multimodal ICL without additional fine tuning and provide insight into how model scale shapes bias alignment and robustness to perturbations.

Abstract

Motivated by in-context learning (ICL) capabilities of Large Language Models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

TL;DR

This work probes why multimodal in-context learning works by systematically varying demonstration modalities and selecting strategies across model scales and a broad task suite. It reveals that visual and textual information contribute differently depending on the task, and that modality-aware demonstration selection can substantially boost performance. The authors show that models can encode task inductive biases from demonstrations, sometimes overriding pretraining priors, with dual-modality strategies offering robust gains. The findings offer practical guidelines for constructing demonstrations to improve multimodal ICL without additional fine tuning and provide insight into how model scale shapes bias alignment and robustness to perturbations.

Abstract

Motivated by in-context learning (ICL) capabilities of Large Language Models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.
Paper Structure (29 sections, 19 figures, 3 tables)

This paper contains 29 sections, 19 figures, 3 tables.

Figures (19)

  • Figure 1: (b) Modality matters differently on ICL across tasks: visual information matters little on Application but a lot on KIE, textual answers are more important to ICL on KIE than that on Application. (c) Demonstrations selected by text-driven strategy BERTScore benefit more on Application, while those selected by visual similarity (CLIP) bring higher accuracy on KIE.
  • Figure 2: Benchmarks with inductive bias contracting the semantic priors (left) or rarely seen in pretraining data (middle and right). We list ground-truth (GT), zero-shot responses from GPT-4o and provide ICL analysis in \ref{['sec:inductive']}.
  • Figure 3: Multimodal ICL performance of IDEFICS1-80b reacts differently across tasks of different difficulty levels against perturbations on visual (top) and textual (bottom) information. For easy (i.e., BenchLMM Sensor and Application) and two moderate (i.e., Path-VQA and Slake-VQA) tasks, performance after various visual perturbations is very close to that given original correct demonstrations, while drops obviously when either textual question or answer is perturbed. For the moderate PAD-UFES-20, neither of the two modalities matters too much. For the hard KIE, we observe degraded performance when the image is removed or replaced. Similar observations from other 5 models can be found from \ref{['fig:ablation_selection_openflamingo2_4B']} to \ref{['fig:ablation_selection_Emu1']}.
  • Figure 4: Influence of modality-driven demonstration selection strategies on ICL performance of IDEFICS1-80B. Text-driven demonstration (e.g., textual CLIP, BERT, and BERTScore) selection strategies always bring performance improvement over zero-shot inference and random strategy. Strategies considering visual modality (e.g., visual CLIP, MMICES, and ALBEF) enhance performance significantly on KIE, where visual modality proves to be critical for ICL performance as illustrated in \ref{['fig:visual_idefics-80b']}. We visualize similar observations of other five models from \ref{['fig:ablation_selection_openflamingo2_4B']} to \ref{['fig:ablation_selection_Emu1']}.
  • Figure 5: The ability to capture inductive biases that contradict semantic priors when presented with flipped in-context exemplar annotations of AMBER Attribution emerges when demonstrations are selected according to textual modality (i.e., textual CLIP, BERT and BERTScore). Ground truth annotations for testing examples are not flipped, so if a model learns to follow flipped labels in demonstrations, its accuracy should be below $50\%$. Given random demonstrations or those selected considering visual modality , models cannot flip predictions to follow flipped annotations, while models can do so provided with demonstrations selected by text-driven strategies (performance decreases to well below $50\%$). We show similar observations on Existence and Relation in \ref{['fig:AMBER_flip']}.
  • ...and 14 more figures