Table of Contents
Fetching ...

ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

Yuan Hu, ZhiYu Cao, PeiFeng Li, QiaoMing Zhu

Abstract

With the rise of multimodal learning, image retrieval plays a crucial role in connecting visual information with natural language queries. Existing image retrievers struggle with processing long texts and handling unclear user expressions. To address these issues, we introduce the conversational query rewriting (CQR) task into the image retrieval domain and construct a dedicated multi-turn dialogue query rewriting dataset. Built on full dialogue histories, CQR rewrites users' final queries into concise, semantically complete ones that are better suited for retrieval. Specifically, We first leverage Large Language Models (LLMs) to generate rewritten candidates at scale and employ an LLM-as-Judge mechanism combined with manual review to curate approximately 7,000 high-quality multimodal dialogues, forming the ReCQR dataset. Then We benchmark several SOTA multimodal models on the ReCQR dataset to assess their performance on image retrieval. Experimental results demonstrate that CQR not only significantly enhances the accuracy of traditional image retrieval models, but also provides new directions and insights for modeling user queries in multimodal systems.

ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval

Abstract

With the rise of multimodal learning, image retrieval plays a crucial role in connecting visual information with natural language queries. Existing image retrievers struggle with processing long texts and handling unclear user expressions. To address these issues, we introduce the conversational query rewriting (CQR) task into the image retrieval domain and construct a dedicated multi-turn dialogue query rewriting dataset. Built on full dialogue histories, CQR rewrites users' final queries into concise, semantically complete ones that are better suited for retrieval. Specifically, We first leverage Large Language Models (LLMs) to generate rewritten candidates at scale and employ an LLM-as-Judge mechanism combined with manual review to curate approximately 7,000 high-quality multimodal dialogues, forming the ReCQR dataset. Then We benchmark several SOTA multimodal models on the ReCQR dataset to assess their performance on image retrieval. Experimental results demonstrate that CQR not only significantly enhances the accuracy of traditional image retrieval models, but also provides new directions and insights for modeling user queries in multimodal systems.

Paper Structure

This paper contains 12 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The necessity of conversational query rewriting. The Original Query of user contains an ambiguous reference ("that scene"), which leads to incorrect retrieval results. The Rewritten Query disambiguates the request by incorporating key visual context from the dialogue history, enabling the retriever to return the correct image.
  • Figure 2: The two-stage dataset construction pipeline. Stage One creates dialogues for a single image $I_1$, generating a standard caption $C_1$, a target query $Tq_1$, a dialogue history $D_1$ and a contextually abbreviated original query $Oq_1$. Stage Two creates dialogues for a semantically related image pair ($I_1$, $I_2$), generating captions ($C_1$, $C_2$), a dialogue $D_2$ that bridges both images, and a final original query $Oq_2$ that resolves to target query $Tq_2$.
  • Figure 3: Semantic relevance validation for image pairing. Candidate images $I_1$ and $I_2$ are processed by BLIP to generate captions. Key entities are extracted from these captions (using spaCy) and their relationship is checked in the ConceptNet knowledge graph.