Table of Contents
Fetching ...

DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu

TL;DR

This work tackles the challenge of suboptimal demonstration retrieval for multimodal in-context learning in large vision-language models. It introduces DRUM, a framework that fine-tunes a visual-language embedding using LVLM feedback to rank and retrieve demonstrations via a unified list-wise loss, supplemented by iterative candidate mining. The approach yields significant improvements over baselines across VQA, image classification, and image captioning tasks and demonstrates transferability to commercial LVLMs. By aligning the retriever with the downstream LVLM’s preferences, DRUM enhances context formation and task adaptation in multimodal ICL with practical implications for real-world LVLM deployment. The combination of joint image–prompt–draft representations and LVLM-driven ranking provides a scalable path to more effective demonstrations in diverse multimodal settings.

Abstract

Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM's feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM's in-context learning performance via retrieving more proper demonstrations.

DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

TL;DR

This work tackles the challenge of suboptimal demonstration retrieval for multimodal in-context learning in large vision-language models. It introduces DRUM, a framework that fine-tunes a visual-language embedding using LVLM feedback to rank and retrieve demonstrations via a unified list-wise loss, supplemented by iterative candidate mining. The approach yields significant improvements over baselines across VQA, image classification, and image captioning tasks and demonstrates transferability to commercial LVLMs. By aligning the retriever with the downstream LVLM’s preferences, DRUM enhances context formation and task adaptation in multimodal ICL with practical implications for real-world LVLM deployment. The combination of joint image–prompt–draft representations and LVLM-driven ranking provides a scalable path to more effective demonstrations in diverse multimodal settings.

Abstract

Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM's feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM's in-context learning performance via retrieving more proper demonstrations.

Paper Structure

This paper contains 16 sections, 11 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The schematic representation of our DRUM framework. Circles, rectangles, and triangles respectively represent the images, prompts, and responses in the triplet.
  • Figure 2: The effects of the number of demonstrations on DRUM, EPR, and CLIP.