Table of Contents
Fetching ...

Dual-Modal Prompting for Sketch-Based Image Retrieval

Liying Gao, Bingliang Jiao, Peng Wang, Shizhou Zhang, Hanwang Zhang, Yanning Zhang

TL;DR

This paper addresses zero-shot, fine-grained SBIR by proposing DP-CLIP, a dual-modal prompting framework built on CLIP that uses category-specific prompts derived from a small set of target-category images and the category label. It introduces a Visual Prompting Module and a Textual Prompting Module to inject category-centric insights, along with a Patch-Level Matching Module to capture local correspondences between sketches and photos. The method achieves a notable Acc.@1 improvement of 7.3 percentage points over state-of-the-art FG-ZS-SBIR on Sketchy and demonstrates competitive results on category-level ZS-SBIR benchmarks, while maintaining modest increases in computation. Overall, DP-CLIP enables flexible, category-aware adaptation to unseen categories and finer-grained retrieval without extensive full-model fine-tuning, enhancing practical SBIR deployment.

Abstract

Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.

Dual-Modal Prompting for Sketch-Based Image Retrieval

TL;DR

This paper addresses zero-shot, fine-grained SBIR by proposing DP-CLIP, a dual-modal prompting framework built on CLIP that uses category-specific prompts derived from a small set of target-category images and the category label. It introduces a Visual Prompting Module and a Textual Prompting Module to inject category-centric insights, along with a Patch-Level Matching Module to capture local correspondences between sketches and photos. The method achieves a notable Acc.@1 improvement of 7.3 percentage points over state-of-the-art FG-ZS-SBIR on Sketchy and demonstrates competitive results on category-level ZS-SBIR benchmarks, while maintaining modest increases in computation. Overall, DP-CLIP enables flexible, category-aware adaptation to unseen categories and finer-grained retrieval without extensive full-model fine-tuning, enhancing practical SBIR deployment.

Abstract

Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
Paper Structure (19 sections, 6 equations, 5 figures, 5 tables)

This paper contains 19 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The comparison between (a) existing methods and (b) our method in addressing the fine-grained zero-shot SBIR task. Most existing models accumulate knowledge from limited seen categories and directly transfer it to unseen ones. We believe this is sub-optimal since not all knowledge acquired in seen categories is transferable, some of which could be invalid or even detrimental. For instance, the presence of the propeller (marked by red bounding boxes) may serve as a distinctive feature for identifying airplanes but not for helicopters. To solve this, we introduce a dual-modal prompting strategy. In this strategy, we employ the corresponding textual category label and a set of images from the target category to equip the model with category-centric insights. This could prompt it to adapt to the target category, thereby achieving better retrieval.
  • Figure 2: The architecture of our DP-CLIP model. We use pre-trained CLIP as the backbone model and freeze all parameters, excluding those inside normalization layers. In our DP-CLIP, the visual prompting module is responsible for generating category-specific visual prompts with a set of images from the target category. The textual prompting (T-Prompt) module utilizes the textual category label to produce category-specific channel scaling vectors, guiding our model to adapt to the target category.
  • Figure 3: The architecture of (a) Original ViT layer, (b) Direct Scaling, and (c) Side-Way Scaling.
  • Figure 4: Visualization results of our visual prompting module. We analyze and display the information within the category-specific visual prompts by computing the cosine similarity between the visual prompts and image tokens. Regions of high similarity are highlighted in red, while regions of low similarity are indicated in blue. In this part, we use support images from the "cabin" and "tree" categories for guidance. The results show that the visual prompts generated with support images from different categories contain different category insights. Given the same images, it could potentially encourage our model to focus on target objects of different categories, thereby efficiently adapting to the target category and improving retrieval.
  • Figure 5: The architecture of our patch-level matching module. As shown in the left part, we divide features extracted from the penultimate ViT layer inside CLIP, which has the spatial scale of $7\times7$, into four $5\times5$ patches, oriented in four directions: top-left, top-right, bottom-left, and bottom-right. The feature tokens within each patch are concatenated with a copied [CLS] token feature and then processed through a local ViT layer. The output features of the [CLS] token in each local branch are sent into a linear projection layer, whose outputs are treated as the final local features for this patch.