Table of Contents
Fetching ...

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, Wanxiang Che

TL;DR

The paper investigates what factors influence multi-modal in-context learning (MM-ICL) by systematically analyzing demonstration retrieval, ordering, and prompt construction. Through an extensive study across six vision-language models and twenty strategies over four tasks, it finds that multi-modal retrieval and intra-demonstration modality ordering are strong determinants of performance, while introductory instructions consistently improve understanding. It also shows that model size is less predictive than alignment quality, and that the MM-ICL context reduces the need for careful demonstration selection. The results offer practical guidelines for designing MM-ICL pipelines and highlight areas such as multi-modal alignment and prompt design for future research and deployment.

Abstract

Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?'' To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

TL;DR

The paper investigates what factors influence multi-modal in-context learning (MM-ICL) by systematically analyzing demonstration retrieval, ordering, and prompt construction. Through an extensive study across six vision-language models and twenty strategies over four tasks, it finds that multi-modal retrieval and intra-demonstration modality ordering are strong determinants of performance, while introductory instructions consistently improve understanding. It also shows that model size is less predictive than alignment quality, and that the MM-ICL context reduces the need for careful demonstration selection. The results offer practical guidelines for designing MM-ICL pipelines and highlight areas such as multi-modal alignment and prompt design for future research and deployment.

Abstract

Recently, rapid advancements in Multi-Modal In-Context Learning (MM-ICL) have achieved notable success, which is capable of achieving superior performance across various tasks without requiring additional parameter tuning. However, the underlying rules for the effectiveness of MM-ICL remain under-explored. To fill this gap, this work aims to investigate the research question: "What factors affect the performance of MM-ICL?'' To this end, we investigate extensive experiments on the three core steps of MM-ICL including demonstration retrieval, demonstration ordering, and prompt construction using 6 vision large language models and 20 strategies. Our findings highlight (1) the necessity of a multi-modal retriever for demonstration retrieval, (2) the importance of intra-demonstration ordering over inter-demonstration ordering, and (3) the enhancement of task comprehension through introductory instructions in prompts. We hope this study can serve as a foundational guide for optimizing MM-ICL strategies in future research.

Paper Structure

This paper contains 44 sections, 12 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The whole process of prompting creation for multi-modal in-context-learning.
  • Figure 2: The demonstration retrieval process for MM-ICL.
  • Figure 3: The demonstration ordering process for MM-ICL.
  • Figure 4: The process of instruction injection for MM-ICL prompt construction involves three key elements. The Introductory Instruction provides an overview instruction of the task before demonstrations. The Summative Instruction summarizes after the examples, guiding the model to apply the learned concepts to practical problems. The Intra-demonstration Instruction embeds task-specific guidance within each demonstration, enabling VLLMs to grasp task requirements during learning. Further details and additional prompts are provided in Appendix \ref{['append:instruction']}.
  • Figure 5: The impact of token pattern representation in Gemini-Pro.
  • ...and 11 more figures