Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Shuo Chen; Zhen Han; Bailan He; Jianzhe Liu; Mark Buckley; Yao Qin; Philip Torr; Volker Tresp; Jindong Gu

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, Volker Tresp, Jindong Gu

TL;DR

This paper investigates whether multimodal large language models (MLLMs) can truly perform multimodal in-context learning (M-ICL) and analyzes two key factors: the content of multimodal demonstrations and the strategy for selecting demonstrations. It finds that textual information in demonstrations dominates ICL performance, while the visual content in the demos has a minimal direct effect, though visuals can aid in selecting informative demos. To improve demo quality, the authors propose Mixed Modality In-Context Example Selection (MMICES), which first filters candidates by image similarity and then reranks by text similarity, outperforming random selection and single-modality retrieval (RICES) across multiple models and datasets. An architectural explanation based on masked-cross-attention clarifies why demo images have limited influence: in practice, demonstration visuals influence outputs mainly through their textual descriptions, not directly via the raw visuals. The work provides practical guidance for designing demonstrations and selection strategies in MLLMs and suggests MMICES as a simple, effective method to boost M-ICL across vision-language tasks and model scales.

Abstract

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method. Code is available at \url{https://chenxshuo.github.io/m-icl/}.

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

TL;DR

Abstract

Paper Structure (13 sections, 10 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 10 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Understanding Influence of Demo Content in Multimodal ICL
Influence of Visual Information on M-ICL
Influence of Textual Information on M-ICL
Further Investigating Content Influence
Understanding Demo Selection in M-ICL
Demo Selection using Single Modality
Mixed Modality in-Context Example Selection
Experiments
Experimental Setup
Results
Conclusion

Figures (10)

Figure 1: In-context learning (2-shot) on visual question answering. Pre-trained MLLMs can perform In-context Learning for a given query based on a few context demos (i.e., a list of images, questions, and answers) selected from a support set.
Figure 2: The ICL performance is almost the same when removing the visual information in the demonstration. Compared to the standard scenario, exclusion and replacement of images in the demonstration hardly impact the ICL performance (as shown in the first three bars of each sub-figure). Conversely, the removal of the query image results in substantial performance degradation (as indicated by the last bar in each sub-figure).
Figure 3: The ICL performance varies under different text demo settings. Performance in different answer for same question can still be maintained (the light orange bars). However, performance significantly decreases in random question and random words as labels (the green and blue bars).
Figure 4: Model block supporting interleaved image-text inputs. Visual and language information, i.e., $I$ and $T$, are first fused using a masked cross-attention layer, where each text token is only conditioned on the last preceding image. Visual embeddings $\mathbf{I_1}$ and $\mathbf{I_2}$ from demonstration images cannot directly influence query text embedding $\mathbf{T_q}$, and $\mathbf{T_q}$ only sees $\mathbf{I_q}$ in the masked cross-attention, as shown in the last row of $\mathbf{A_c}$.
Figure 5: The left figure shows the cosine similarity between hidden states in the standard setting and removing images in the demos (blue bars). Grey bars are cosine similarity between standard setting and removing query images. The right figure shows the similarity of the corresponding attention weights in the last decoder layer. Omitting demonstration visual embeddings leads to similar hidden states, but excluding query images increases their dissimilarity, indicating the minimal influence of the demo images.
...and 5 more figures

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

TL;DR

Abstract

Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)