Table of Contents
Fetching ...

Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, Sungroh Yoon

TL;DR

This work tackles interactive text-to-image retrieval by exposing the limitations of using raw dialogues with zero-shot retrievers. It introduces PlugIR, a plug-and-play framework with two modules: context reformulation, which converts dialogue context into caption-like input for pre-trained vision-language retrievers, and context-aware dialogue generation, which grounds LLM questions in retrieval candidates via retrieval-context extraction and filtering. To fairly evaluate multi-turn retrieval, it proposes Best log Rank Integral (BRI), a K-agnostic metric that jointly captures user satisfaction, efficiency, and ranking improvements, and demonstrates its strong alignment with human judgments. Across VisDial, COCO, and Flickr30k, PlugIR outperforms zero-shot and fine-tuned baselines and remains effective across different retrievers, illustrating practical plug-and-play applicability and robustness to perturbations and model choices.

Abstract

In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR.

Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

TL;DR

This work tackles interactive text-to-image retrieval by exposing the limitations of using raw dialogues with zero-shot retrievers. It introduces PlugIR, a plug-and-play framework with two modules: context reformulation, which converts dialogue context into caption-like input for pre-trained vision-language retrievers, and context-aware dialogue generation, which grounds LLM questions in retrieval candidates via retrieval-context extraction and filtering. To fairly evaluate multi-turn retrieval, it proposes Best log Rank Integral (BRI), a K-agnostic metric that jointly captures user satisfaction, efficiency, and ranking improvements, and demonstrates its strong alignment with human judgments. Across VisDial, COCO, and Flickr30k, PlugIR outperforms zero-shot and fine-tuned baselines and remains effective across different retrievers, illustrating practical plug-and-play applicability and robustness to perturbations and model choices.

Abstract

In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR.
Paper Structure (42 sections, 2 equations, 9 figures, 20 tables, 2 algorithms)

This paper contains 42 sections, 2 equations, 9 figures, 20 tables, 2 algorithms.

Figures (9)

  • Figure 1: The main framework of the plug-and-play interactive text-to-image retrieval system.
  • Figure 2: Round-by-round text-to-image retrieval performances of CLIP, BLIP, BLIP-2, and the Amazon Titan multimodal foundation model (ATM). In the 0th round, an image caption is provided as the query, and with each subsequent round, a single question-answer pair is added. Solid lines represent Recall@10, while dotted lines indicate Hits@10.
  • Figure 3: Hits@10 comparisons of our proposed method with ZS and FT on VisDial, COCO, and Flickr30k.
  • Figure 4: Round-by-round text-to-image retrieval performances of Model fine-tuning and context reformulation. Solid lines represent Recall@10, while dotted lines indicate Hits@10.
  • Figure 5: Round-by-round text-to-image retrieval performances in the ablation study. Solid lines represent Recall@10, while dotted lines indicate Hits@10.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2