Towards Multimodal In-Context Learning for Vision & Language Models

Sivan Doveh; Shaked Perek; M. Jehanzeb Mirza; Wei Lin; Amit Alfassy; Assaf Arbelle; Shimon Ullman; Leonid Karlinsky

Towards Multimodal In-Context Learning for Vision & Language Models

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

TL;DR

This work tackles the limited in-context learning (ICL) capabilities of state-of-the-art vision-language models by introducing a simple, effective curriculum-based ICL instruction tuning framework built on top of Llava+. It uses multi-turn, any-shot conversations with semantically coherent ICL task mixes sourced from SEED and VL-Checklist data, enabling VLMs to leverage in-context demonstrations without sacrificing core zero-shot abilities. The authors provide new ICL benchmarks for VLMs and report substantial, consistent gains (around 11–13 percentage points on average) across fine-grained few-shot recognition and other ICL tasks, along with ablations that highlight the critical role of data mixing, semantic coherence, and replay. The results offer practical guidance for designing ICL-focused training curricula for multimodal models and open avenues for longer context and richer ICL task types in multimodal settings.

Abstract

State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.

Towards Multimodal In-Context Learning for Vision & Language Models

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 16 figures, 10 tables)

This paper contains 36 sections, 2 equations, 16 figures, 10 tables.

Introduction
Related Work
Vision-Language Foundation Models:
In-context Learning for VLMs:
Method
Multi-turn ICL conversations
ICL instruction task types
Data sources for ICL instruction mixes
ICL benchmarks
Results
Evaluation Settings
Datasets:
Baselines:
Metrics:
Implementation Details:
...and 21 more sections

Figures (16)

Figure 1: Multiple data sources are used to generate multi-modal ICL instructions varying the types of ICL tasks and type of semantic concepts shared within each instruction, teaching the VLM to properly correlate information between ICL in-context shots. Our insights on the best training data mix along with our proposed "any-shot" training paradigm enhance the VLM's ICL abilities.
Figure 2: Causal (left only) attention and formatting the ICL examples as consecutive conversation turns, results in 'any-shot' training where the first turn prediction is "zero-shot", the next turn predicts the response given the context of the first, and so on, resulting in a dynamic "any-shot" context. The grey shades illustrate the context that each turn's response attends to. As llava+ we do completion-only training, masking all but the desired responses (blue) in the target.
Figure 3: MME scores of baseline and our model.
Figure 4: Varying number of Shots on VL tasks.
Figure 5: Mean Accuracy (%) while scaling ICL instruction data.
...and 11 more figures

Towards Multimodal In-Context Learning for Vision & Language Models

TL;DR

Abstract

Towards Multimodal In-Context Learning for Vision & Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)