DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval
Zongwei Zhen, Biqing Zeng
TL;DR
DIR-TIR tackles interactive text-to-image retrieval by enabling multi-turn dialogue and leveraging two complementary refinement modules. The Dialog Refiner Module generates targeted questions using candidate-image context and refines descriptions with LLM reasoning, while the Image Refiner Module optimizes prompts through visual discrepancy feedback; both feed into a hybrid ranking that selects a fixed-size candidate set (10 images). Across VisDial, COCO, and Flickr30k, the approach yields higher Hits@10 and Recall@10 compared with single-shot baselines (BLIP/CLIP) and chat-based methods, without requiring fine-tuning of the vision-language backbone. This work demonstrates improved controllability, fault tolerance, and scalability for interactive image search in large photo collections.
Abstract
This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.
