Table of Contents
Fetching ...

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

Bulat Khaertdinov, Mirela Popa, Nava Tintarev

TL;DR

This work addresses improving text-to-image retrieval with vision-language models at test time using relevance feedback. It introduces four strategies—PRF, GRF, Attentive Feedback Summarizer (AFS), and explicit feedback—built on a Rocchio-style embedding refinement, with GRF leveraging synthetic captions from LLaVA and AFS employing a compact two-block transformer to aggregate fine-grained feedback signals. Across Flickr30K and COCO with multiple backbones, GRF, AFS, and explicit feedback yield consistent gains (approximately 3–5% in MRR@5 for smaller VLMs and 1–3% for larger ones), with AFS showing robustness to query drift and often approaching explicit feedback performance. The results demonstrate a practical, model-agnostic approach to interactive visual search that can be integrated on top of existing VLMs to improve retrieval without expensive fine-tuning.

Abstract

Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

TL;DR

This work addresses improving text-to-image retrieval with vision-language models at test time using relevance feedback. It introduces four strategies—PRF, GRF, Attentive Feedback Summarizer (AFS), and explicit feedback—built on a Rocchio-style embedding refinement, with GRF leveraging synthetic captions from LLaVA and AFS employing a compact two-block transformer to aggregate fine-grained feedback signals. Across Flickr30K and COCO with multiple backbones, GRF, AFS, and explicit feedback yield consistent gains (approximately 3–5% in MRR@5 for smaller VLMs and 1–3% for larger ones), with AFS showing robustness to query drift and often approaching explicit feedback performance. The results demonstrate a practical, model-agnostic approach to interactive visual search that can be integrated on top of existing VLMs to improve retrieval without expensive fine-tuning.

Abstract

Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.

Paper Structure

This paper contains 33 sections, 11 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Proposed VLM-based text-to-image retrieval with relevance feedback. Query representations are refined in feature space using relevance feedback vectors.
  • Figure 2: Attentive Feedback Summarizer. The user query, relevant images, and synthetic captions are processed using VLM encoders.
  • Figure 3: Retrieval with Rocchio: original vs ours. Retrieval metrics with relevance feedback using original and extended versions of Rocchio.
  • Figure 4: Multi-turn retrieval on Flickr30K. MRR@5 scores for multi-turn retrieval with relevance feedback. CLIP-B and CLIP-L refer to CLIP-ViT-B/32 and CLIP-ViT-L/14, respectively.
  • Figure 5: Cross-attention visualization. The example is sampled from Flickr30K dataset and processed with CLIP-ViT-B/32 as a retrieval backbone. The ground truth image corresponding to the query is highlighted with a green frame.
  • ...and 12 more figures