You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
TL;DR
This work addresses the gap in fine-grained image retrieval by enabling a sketch+text duet rather than relying solely on sketches. It builds a CLIP-based framework that maps sketches to pseudo-word tokens and combines them with textual prompts through a novel compositionality mechanism, including a sketch–photo difference proxy $T = P - S$ and a neutral-text regularizer. The approach integrates multiple losses—compositionality, text-prompt generalisation, region-aware matching, and reconstruction—to achieve state-of-the-art results on FG-SBIR benchmarks and enables downstream tasks like sketch+text-based generation and domain transfer. The method reduces the need for large-scale fine-grained sketch–text datasets and demonstrates practical impact across object- and scene-level retrieval, with promising extensions to generated content and cross-domain applications.
Abstract
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.
