Table of Contents
Fetching ...

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas

TL;DR

This work tackles the limitations of single-turn image retrieval by introducing a multi-turn, interactive system that refines text queries using user relevance feedback. It combines a vision-language model to generate image captions and an LLM-based denoiser to clean expanded queries, enabling progressively more informative NL queries across turns. A new MSR-VTT-adapted dataset with multiple ground-truth images per query supports evaluation, and experiments show a state-of-the-art ~ 10% recall gain over baselines after 6 turns. The approach demonstrates the practical impact of integrating VLMs and LLMs for interactive, cross-modal retrieval with improved robustness to caption noise.

Abstract

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

TL;DR

This work tackles the limitations of single-turn image retrieval by introducing a multi-turn, interactive system that refines text queries using user relevance feedback. It combines a vision-language model to generate image captions and an LLM-based denoiser to clean expanded queries, enabling progressively more informative NL queries across turns. A new MSR-VTT-adapted dataset with multiple ground-truth images per query supports evaluation, and experiments show a state-of-the-art ~ 10% recall gain over baselines after 6 turns. The approach demonstrates the practical impact of integrating VLMs and LLMs for interactive, cross-modal retrieval with improved robustness to caption noise.

Abstract

Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
Paper Structure (18 sections, 2 equations, 4 figures, 2 tables)

This paper contains 18 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Flowchart of the proposed multi-turn interactive image retrieval approach based on relevance feedback. This diagram illustrates an example of the initial interaction followed by query expansion and query refinement.
  • Figure 2: The trend in accumulated recall of the methods shows a steady increase with the number of interaction turns. However, the line representing the Rocchio method levels off after the second turn, indicating that no further relevant images are retrieved beyond this point. In contrast, the other two methods demonstrate the ability to consistently retrieve new relevant images across successive turns.
  • Figure 3: The performance evaluation of systems utilizing two variants of LLMs and four different sizes for CoT query summaries. The CLIP model is employed as the image retriever in this setup.
  • Figure 4: Example queries and search results from the proposed system with CoT query summaries are showcased. Irrelevant images are denoted with a red cross for clarity.