Table of Contents
Fetching ...

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song

TL;DR

The paper addresses zero-shot sketch-based image retrieval by leveraging frozen text-to-image diffusion models as feature extractors. It introduces a simple yet effective prompt-learning strategy—comprising visual prompts on the input and a learnable textual prompt—coupled with selective layer/time-step feature extraction to produce discriminative cross-modal representations without fine-tuning. Across category-level and fine-grained SBIR benchmarks, the approach achieves state-of-the-art performance and demonstrates strong robustness through feature ensembling, while also extending to sketch+text-based retrieval. The work highlights diffusion models as practical, high-capacity backbones for cross-modal retrieval with clear guidelines for prompt design and layer selection, enabling efficient deployment in real-world systems.

Abstract

This paper, for the first time, explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias, findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively, we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former, we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process, enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements.

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

TL;DR

The paper addresses zero-shot sketch-based image retrieval by leveraging frozen text-to-image diffusion models as feature extractors. It introduces a simple yet effective prompt-learning strategy—comprising visual prompts on the input and a learnable textual prompt—coupled with selective layer/time-step feature extraction to produce discriminative cross-modal representations without fine-tuning. Across category-level and fine-grained SBIR benchmarks, the approach achieves state-of-the-art performance and demonstrates strong robustness through feature ensembling, while also extending to sketch+text-based retrieval. The work highlights diffusion models as practical, high-capacity backbones for cross-modal retrieval with clear guidelines for prompt design and layer selection, enabling efficient deployment in real-world systems.

Abstract

This paper, for the first time, explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias, findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively, we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former, we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process, enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements.
Paper Structure (12 sections, 3 equations, 7 figures, 5 tables)

This paper contains 12 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Feature extraction via text-to-image diffusion model.
  • Figure 2: Texture-altered images from cue-conflict geirhos2018imagenet dataset.
  • Figure 3: Given the frozen SD rombach2022high backbone feature extractor, our method learns a single textual prompt, and sketch/photo-specific visual prompts via triplet loss.
  • Figure 4: Plots showing low-data scenario performance for ZS-SBIR (left) and ZS-FG-SBIR (right) setup on Sketchy sangkloy2016the dataset.
  • Figure 5: PCA representation of SD rombach2022high internal features from $\mathcal{F}_\mathbf{u}^1$ upsampling layers of UNet for different time-steps ($t\in[0,100, ..., 900]$). Different regions of sketch and photo feature maps from $t\in[200,300]$ (highlighted in red) portray strong semantic feature correspondence (represented by the same colours in the PCA map), while the features from the later time-steps are non-aligned.
  • ...and 2 more figures