How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Yang Luo; Zangwei Zheng; Zirui Zhu; Yang You

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Yang Luo, Zangwei Zheng, Zirui Zhu, Yang You

TL;DR

A novel supervised MLLM prompt retriever MSIER is introduced that leverages a trained retriever based on MLLM’s confidence to select examples, which enhances multimodal in-context learning efficiency and investigates the influence of modalities on the supervised retrieval method’s training and explores the transferability of the supervised prompt retriever.

Abstract

The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 6 figures, 10 tables)

This paper contains 22 sections, 2 equations, 6 figures, 10 tables.

Introduction
Related Work
Method
Multimodal In-Context Learning
Importance of text information for unsupervised MLLM retrieval
Multimodal Supervised Prompt Retriever
Importance of Textual Information in Supervised Retriever
Experiment
Datasets
Compared Methods
Main Results
Further Analysis
Ablation Study
Conclusion
Limitations
...and 7 more sections

Figures (6)

Figure 1: An overview of multimodal in-context example retrieval: This process involves receiving an image or an image-text query from the test dataset, and then using a retrieval mechanism to find similar examples in a training dataset. These examples and the original query (collectively called the prompt) are then inputted into a MLLM to generate the output.
Figure 2: Overview of the MSIER Method: The fundamental principle involves assessing the in-context learning performance for each source instance, thereafter identifying and choosing those instances exhibiting the most favorable or least favorable outcomes. These selected instances are then utilized to form a dataset, categorized as either positive or negative, which is essential for the facilitation of contrastive learning. The examples with high CIDEr scores (corresponding to low NLL loss during the scoring process) are selected as positive samples.
Figure 3: The introduction of textual information in the unsupervised method leads to a higher M-ICL performance across all numbers of shots, demonstrating the importance of text modality.
Figure 4: Impact of texts on proposed MSIER method. 'T' denotes the Training setting and 'E' denotes the Evaluation setting.
Figure 5: Impact of the order of retrieved multimodal in-context examples.
...and 1 more figures

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

TL;DR

Abstract

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)