Table of Contents
Fetching ...

Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval

Hanwen Su, Ge Song, Jiyan Wang, Yuanbo Zhu

TL;DR

This work tackles zero-shot sketch-based image retrieval by addressing modality and domain gaps with a Dynamic Multi-level Weighted Alignment Network (DM-WAN). The method combines uni-modal feature extraction (ViT for sketches/images and CLIP-based text encoding), cross-modal weighting across local and global levels, and a weighted quadruplet loss that balances modality contributions while emphasizing high-quality alignments. Key contributions include a novel dynamic pairing weight mechanism and a domain-balanced loss that improves cross-modal generalization to unseen categories, validated on Sketchy, TU-Berlin, and QuickDraw with state-of-the-art results. The approach promises practical impact in zero-shot retrieval scenarios by reducing training-data quality issues and enhancing robustness to diverse sketch and image appearances.

Abstract

The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.

Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval

TL;DR

This work tackles zero-shot sketch-based image retrieval by addressing modality and domain gaps with a Dynamic Multi-level Weighted Alignment Network (DM-WAN). The method combines uni-modal feature extraction (ViT for sketches/images and CLIP-based text encoding), cross-modal weighting across local and global levels, and a weighted quadruplet loss that balances modality contributions while emphasizing high-quality alignments. Key contributions include a novel dynamic pairing weight mechanism and a domain-balanced loss that improves cross-modal generalization to unseen categories, validated on Sketchy, TU-Berlin, and QuickDraw with state-of-the-art results. The approach promises practical impact in zero-shot retrieval scenarios by reducing training-data quality issues and enhancing robustness to diverse sketch and image appearances.

Abstract

The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.

Paper Structure

This paper contains 16 sections, 16 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overview of our proposed method. (i)Uni-modal Feature Extraction collects the text sketch and image into their corresponding backbone, a CLIP-16's text encoder, and a ViT for extracting sketch and image features. (ii) Cross-modal Multi-level Weighting takes tokens after cross-attention as its input and calculates the weights of each training sample in a batch. (iii) Finally weighted quadruplet loss is utilized for metric learning, which includes an anchor sketch, a negative sketch, a positive image, and a negative image.
  • Figure 2: Exemplar comparison retrieval results for the given query sketches and the top 6 retrieved images. Red box denotes false positive, Green box denotes true positive.
  • Figure 3: Results of multi-level weights reflect the similarities of sketch-image exemplar pairs during training. Closer to 1 means more similarity.