Table of Contents
Fetching ...

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

Yang Luo, Zangwei Zheng, Zirui Zhu, Yang You

TL;DR

A novel supervised MLLM prompt retriever MSIER is introduced that leverages a trained retriever based on MLLM’s confidence to select examples, which enhances multimodal in-context learning efficiency and investigates the influence of modalities on the supervised retrieval method’s training and explores the transferability of the supervised prompt retriever.

Abstract

The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

TL;DR

A novel supervised MLLM prompt retriever MSIER is introduced that leverages a trained retriever based on MLLM’s confidence to select examples, which enhances multimodal in-context learning efficiency and investigates the influence of modalities on the supervised retrieval method’s training and explores the transferability of the supervised prompt retriever.

Abstract

The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. This effectiveness, however, hinges on the appropriate selection of in-context examples, a process that is currently biased towards visual data, overlooking textual information. Furthermore, the area of supervised retrievers for MLLMs, crucial for optimal in-context example selection, continues to be uninvestigated. Our study offers an in-depth evaluation of the impact of textual information on the unsupervised selection of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Responding to this, we introduce a novel supervised MLLM-retriever MSIER that employs a neural network to select examples that enhance multimodal in-context learning efficiency. This approach is validated through extensive testing across three distinct tasks, demonstrating the method's effectiveness. Additionally, we investigate the influence of modalities on our supervised retrieval method's training and pinpoint factors contributing to our model's success. This exploration paves the way for future advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data.
Paper Structure (22 sections, 2 equations, 6 figures, 10 tables)

This paper contains 22 sections, 2 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: An overview of multimodal in-context example retrieval: This process involves receiving an image or an image-text query from the test dataset, and then using a retrieval mechanism to find similar examples in a training dataset. These examples and the original query (collectively called the prompt) are then inputted into a MLLM to generate the output.
  • Figure 2: Overview of the MSIER Method: The fundamental principle involves assessing the in-context learning performance for each source instance, thereafter identifying and choosing those instances exhibiting the most favorable or least favorable outcomes. These selected instances are then utilized to form a dataset, categorized as either positive or negative, which is essential for the facilitation of contrastive learning. The examples with high CIDEr scores (corresponding to low NLL loss during the scoring process) are selected as positive samples.
  • Figure 3: The introduction of textual information in the unsupervised method leads to a higher M-ICL performance across all numbers of shots, demonstrating the importance of text modality.
  • Figure 4: Impact of texts on proposed MSIER method. 'T' denotes the Training setting and 'E' denotes the Evaluation setting.
  • Figure 5: Impact of the order of retrieved multimodal in-context examples.
  • ...and 1 more figures