Table of Contents
Fetching ...

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Di Wu, Yixin Wan, Kai-Wei Chang

TL;DR

Cross-modal text-to-image retrieval often treats embeddings as bags of concepts, failing to capture structured visual relations such as pose and viewpoint. VisRet tackles this by first visualizing the textual query through T2I generation and then performing retrieval wholly in the image modality, using Reciprocal Rank Fusion to combine multiple visual queries. The approach yields consistent, state-of-the-art improvements across four benchmarks (Visual-RAG, INQUIRE-Rerank, COCO, Visual-RAG-ME) and boosts downstream VQA performance in retrieval-augmented generation, all while remaining compatible with various T2I models and LVLMs. This modular, training-free pipeline provides a practical path to advance knowledge-intensive vision-language retrieval and will be released with the Visual-RAG-ME benchmark for broader evaluation.

Abstract

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical and principled path that energizes further advances in vision-language retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

TL;DR

Cross-modal text-to-image retrieval often treats embeddings as bags of concepts, failing to capture structured visual relations such as pose and viewpoint. VisRet tackles this by first visualizing the textual query through T2I generation and then performing retrieval wholly in the image modality, using Reciprocal Rank Fusion to combine multiple visual queries. The approach yields consistent, state-of-the-art improvements across four benchmarks (Visual-RAG, INQUIRE-Rerank, COCO, Visual-RAG-ME) and boosts downstream VQA performance in retrieval-augmented generation, all while remaining compatible with various T2I models and LVLMs. This modular, training-free pipeline provides a practical path to advance knowledge-intensive vision-language retrieval and will be released with the Visual-RAG-ME benchmark for broader evaluation.

Abstract

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts and underrepresent structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical and principled path that energizes further advances in vision-language retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.

Paper Structure

This paper contains 41 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An overview of VisRet. Compared to the traditional T2I retrieval pipeline, VisRet first projects the text query into the image modality via T2I generation and then performs within-modality retrieval.
  • Figure 2: Downstream RAG-based VQA accuracy on Visual-RAG and Visual-RAG-ME with CLIP as the retriever and GPT-4o as the reader LVLM.
  • Figure 3: Prompt for instructing an LLM to generate the T2I generation instruction for Visual-RAG questions.
  • Figure 4: Prompt for instructing an LLM to generate the T2I generation instruction for Visual-RAG-ME questions.
  • Figure 5: Prompt for instructing an LLM to generate the T2I generation instruction for INQUIRE-Rerank-Hard and COCO-Hard questions.
  • ...and 3 more figures