From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning
Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, Muhao Chen
TL;DR
This work probes why multimodal in-context learning works by systematically varying demonstration modalities and selecting strategies across model scales and a broad task suite. It reveals that visual and textual information contribute differently depending on the task, and that modality-aware demonstration selection can substantially boost performance. The authors show that models can encode task inductive biases from demonstrations, sometimes overriding pretraining priors, with dual-modality strategies offering robust gains. The findings offer practical guidelines for constructing demonstrations to improve multimodal ICL without additional fine tuning and provide insight into how model scale shapes bias alignment and robustness to perturbations.
Abstract
Motivated by in-context learning (ICL) capabilities of Large Language Models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.
