Table of Contents
Fetching ...

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song

TL;DR

This work addresses the gap in fine-grained image retrieval by enabling a sketch+text duet rather than relying solely on sketches. It builds a CLIP-based framework that maps sketches to pseudo-word tokens and combines them with textual prompts through a novel compositionality mechanism, including a sketch–photo difference proxy $T = P - S$ and a neutral-text regularizer. The approach integrates multiple losses—compositionality, text-prompt generalisation, region-aware matching, and reconstruction—to achieve state-of-the-art results on FG-SBIR benchmarks and enables downstream tasks like sketch+text-based generation and domain transfer. The method reduces the need for large-scale fine-grained sketch–text datasets and demonstrates practical impact across object- and scene-level retrieval, with promising extensions to generated content and cross-domain applications.

Abstract

Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

TL;DR

This work addresses the gap in fine-grained image retrieval by enabling a sketch+text duet rather than relying solely on sketches. It builds a CLIP-based framework that maps sketches to pseudo-word tokens and combines them with textual prompts through a novel compositionality mechanism, including a sketch–photo difference proxy and a neutral-text regularizer. The approach integrates multiple losses—compositionality, text-prompt generalisation, region-aware matching, and reconstruction—to achieve state-of-the-art results on FG-SBIR benchmarks and enables downstream tasks like sketch+text-based generation and domain transfer. The method reduces the need for large-scale fine-grained sketch–text datasets and demonstrates practical impact across object- and scene-level retrieval, with promising extensions to generated content and cross-domain applications.

Abstract

Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.
Paper Structure (13 sections, 6 equations, 7 figures, 3 tables)

This paper contains 13 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A query sketch $(\mathcal{S})$ is passed via CLIP's visual encoder $(\mathbf{V})$ followed by the visual-to-word converter $(\mathbf{C_{v2w}})$ to obtain pseudo-word token embedding $(\mathbf{s}^w)$. It is then appended with a learnable continuous prompt $\mathbf{P}^L \in \mathbb{R}^{3\times d}$ and passed via frozen $\mathbf{T_t}$ to produce the final sketch embedding $\mathbf{s}_L^T$. Compositionality constraint (middle) is importantly a part of our multitask training (not a two-stage approach baldrati2023zerosaito2023pic2word), where we compute $\mathbf{s}_L^{T,\Delta}$ (\ref{['composition']}) by passing the sketch-photo difference signal$\mathbf{\Delta}$ via $\mathbf{C_{v2w}}$ and appending as $\mathbf{s}_L^{T,\Delta}$=$\{\mathbf{P}^L; \mathbf{s}^w; \mathbf{\Delta}^w\}$, using which $\mathcal{L}_{\text{comp}}$ is imposed. However, as this numeric signal $\mathbf{\Delta}^w$ does not exist in CLIP's radford2021learning input text manifold, it may disrupt its grammatical syntax. Thus, we mine a set of "neutral text" (via GPT brown2020language) to impose a regularisation loss $\mathcal{L}_{\text{reg}}$. Apart from $\mathcal{L}_{\text{trip}}$, we use $\mathcal{L}_{\text{RT}}$ (region-aware triplet) with $\mathbf{s}_L^T$ and photo embeddings $\mathbf{p^+}/\mathbf{p^-}$ to enforce fine-grained matching. Additionally, a reconstruction loss $\mathcal{L}_{\text{rec}}$ trains a UNet decoder $(\mathbf{G})$ for further cross-modal alignment (\ref{['fine_grained']}). Furthermore, $\mathcal{L}_{\text{TT}}$, using a pre-defined set of standard language prompts, brings learnable prompts closer to actual English prompts for unseen set generalisation (\ref{['text_text']}). Specifically, we only train the $\mathtt{LayerNorm}$ of $\mathbf{V}$, $\mathbf{C_{v2w}}$, $\mathbf{P}^L$, and $\mathbf{G}$. The testing pipeline is shown on the right. (Best view when zoomed.)
  • Figure 2: Top-5 fine-grained retrieval result comparison on ShoeV2/ChairV2. GT photos are green-bordered. (Zoom-in for best-view)
  • Figure 3: Qualitative results for sketch+text composed fine-grained generation with pre-trained StyleGAN2 karras2020analyzing models.
  • Figure 4: Qualitative result for object sketch-based scene image retrieval on FS-COCO chowdhury2022fs. GT photos are green-bordered.
  • Figure 5: Top-3 domain attribute transfer results comparison on ImageNet-R hendrycks2021many. GT photos are green-bordered.
  • ...and 2 more figures