Table of Contents
Fetching ...

GenIR: Generative Visual Feedback for Mental Image Retrieval

Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis

TL;DR

Mental Image Retrieval (MIR) targets realistic, multi-round image search guided by a user’s internal mental image. GenIR introduces a generative visual feedback loop where a text-to-image generator produces $I_t^{\text{synthetic}}=G(q_t)$ that is used for image-to-image retrieval via $I_{(t)}^{\text{retrieved}} = \arg\max_{I\in\mathcal{N}} \text{cos}(\phi(I_t^{\text{synthetic}}), \phi(I))$, enabling explicit visualization of the system’s interpretation and guiding subsequent refinements. The authors also present an automated pipeline to curate a multi-round MIR dataset and show that GenIR substantially improves Hits@10 on MSCOCO and across FFHQ, Flickr30k, and Clothing-ADC, with robustness to the choice of diffusion generator. Overall, GenIR establishes a foundation for interpretable, visually grounded interactive multimodal retrieval and opens avenues for RL-based optimization and human-in-the-loop studies in MIR.

Abstract

Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction

GenIR: Generative Visual Feedback for Mental Image Retrieval

TL;DR

Mental Image Retrieval (MIR) targets realistic, multi-round image search guided by a user’s internal mental image. GenIR introduces a generative visual feedback loop where a text-to-image generator produces that is used for image-to-image retrieval via , enabling explicit visualization of the system’s interpretation and guiding subsequent refinements. The authors also present an automated pipeline to curate a multi-round MIR dataset and show that GenIR substantially improves Hits@10 on MSCOCO and across FFHQ, Flickr30k, and Clothing-ADC, with robustness to the choice of diffusion generator. Overall, GenIR establishes a foundation for interpretable, visually grounded interactive multimodal retrieval and opens avenues for RL-based optimization and human-in-the-loop studies in MIR.

Abstract

Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction

Paper Structure

This paper contains 61 sections, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of methods for Mental Image Retrieval task. Top: our generative method, which reifies the intermediate query using an image generator model and applies image-to-image search for retrieval. Bottom: Existing approach (ChatIR and PlugIR) which support multi-round query improvements based on verbal feedback.
  • Figure 2: Visual progression of GenIR's image refinement process. Each row shows the evolution from initial generation (leftmost), through multiple feedback iterations (middle columns), to final generated result, alongside the target image (rightmost). Note how generated images progressively capture more accurate details with each iteration—improving clothing and posture (row 1), facial features and giraffe positioning (row 2), and dining scene composition (row 3).
  • Figure 3: Performance Comparison on MSCOCO Dataset (Hits@10, 50k search space). Left: Our GenIR approach with Infinity diffusion model (Yellow) significantly outperforms all baselines, including Prediction Feedback (blue), Verbal Feedback with Gemma3-12b (red), and ChatIR (green). Right: Comparison of different text-to-image diffusion models within our GenIR framework, showing consistent performance advantages across all generators, with Infinity and Lumina achieving the best results after 10 interaction rounds.
  • Figure 4: Performance Comparison on FFHQ, Flickr30k, and ClothingADC datasets (Hits@10). Our GenIR approach (yellow) consistently outperforms all baselines across domains, with particularly strong advantages in FFHQ and ClothingADC (the latter with a 1M+ image search space).
  • Figure 5: Analysis of vision-language model scale effects across feedback methods on MSCOCO (top) and FFHQ (bottom). While 12b models outperform 4b counterparts as expected, our GenIR with the smaller 4b model consistently surpasses alternative approaches even when using larger models.
  • ...and 3 more figures