Table of Contents
Fetching ...

What do vision-language models see in the context? Investigating multimodal in-context learning

Gabriel O. dos Santos, Esther Colombini, Sandra Avila

TL;DR

This paper addresses the gap in understanding multimodal in-context learning (ICL) in vision-language models (VLMs) by conducting a systematic evaluation across seven models and four architectures on three image captioning benchmarks. It investigates how prompt design, training data structure (interleaved versus image-text paired), and templates affect ICL, and it introduces an attention-based analysis to assess how models use in-context information. The findings reveal that interleaved image-text training improves ICL but does not guarantee effective multimodal integration, while instruction tuning enhances instruction-following yet can diminish reliance on demonstrations; attention patterns show a persistent bias toward textual cues. These results highlight important limitations in current VLMs and suggest directions for improving multimodal ICL through better modality bridging and hybrid training strategies. The work has practical implications for designing prompt pipelines and training regimes that enable more robust multimodal learning from in-context demonstrations.

Abstract

In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

What do vision-language models see in the context? Investigating multimodal in-context learning

TL;DR

This paper addresses the gap in understanding multimodal in-context learning (ICL) in vision-language models (VLMs) by conducting a systematic evaluation across seven models and four architectures on three image captioning benchmarks. It investigates how prompt design, training data structure (interleaved versus image-text paired), and templates affect ICL, and it introduces an attention-based analysis to assess how models use in-context information. The findings reveal that interleaved image-text training improves ICL but does not guarantee effective multimodal integration, while instruction tuning enhances instruction-following yet can diminish reliance on demonstrations; attention patterns show a persistent bias toward textual cues. These results highlight important limitations in current VLMs and suggest directions for improving multimodal ICL through better modality bridging and hybrid training strategies. The work has practical implications for designing prompt pipelines and training regimes that enable more robust multimodal learning from in-context demonstrations.

Abstract

In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our evaluation pipeline for assessing the ICL capability of VLMs. We illustrate the demonstration retrieval and caption generation steps.
  • Figure 2: Investigated templates.
  • Figure 3: Evaluating ICL capacity of VLMs across different scenarios. We evaluate the models using both straightforward and detailed templates. Additionally, we explore scenarios where demonstration captions are provided. However, the demonstration images are blacked out, as well as cases where only the captions are available as context. "Idefics2 8B (IT)" refers to the instruction-tuned checkpoint of the Idefics2 architecture.
  • Figure 4: Layer-wise attention analysis. The upper row presents the variation of mean attention weight assigned to a visual token across the models' LLM layers. The lower row shows the attention entropy across all tokens at each LLM layer, reflecting the diffuseness of attention distribution. For comparability, the charts plot min-max normalized values.
  • Figure 5: Attention maps with scores aggregated by token type (log-scale). Maps are plotted for InstructBLIP Vicuna-7B and Idefics2 models, comparing the 5-shot setting across prompts built on the straightforward template (first row), detailed template (second row), and demonstrations with blacked-out images (third row). Columns correspond to the respective models.