Table of Contents
Fetching ...

Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval

Yang Liu, Jiale Du, Xinbo Gao, Jungong Han

TL;DR

This paper tackles zero-shot sketch-based image retrieval by introducing RAMLN, a memory-augmented meta-learning framework that learns an adaptive margin for a relation-aware quadruplet loss. The loss combines inter-modal and intra-modal constraints with two negatives from different modalities to better separate classes and align sketches with photos, while a meta-learned margin $\mathcal{R}(x)$ stored in external memory enables strong generalization to unseen categories. An auxiliary cross-entropy objective stabilizes training, and experiments on Sketchy Extended and TU-Berlin Extended show clear improvements over state-of-the-art methods, validating both the loss design and the margin adaptation mechanism. Overall, RAMLN advances cross-modal metric learning for ZS-SBIR by enabling dynamic margin adaptation and leveraging memory to capture rare but discriminative features across seen classes, improving generalization to unseen classes in practice.

Abstract

Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. However, its practical application is limited by its inability to retrieve classes absent from the training set. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), where model performance is evaluated on unseen categories. Traditional SBIR primarily focuses on narrowing the domain gap between photo and sketch modalities. However, in the zero-shot setting, the model not only needs to address this cross-modal discrepancy but also requires a strong generalization capability to transfer knowledge to unseen categories. To this end, we propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps. By incorporating two negative samples from different modalities, the approach prevents positive features from becoming disproportionately distant from one modality while remaining close to another, thus enhancing inter-class separability. We also propose a Relation-Aware Meta-Learning Network (RAMLN) to obtain the margin, a hyper-parameter of cross-modal quadruplet loss, to improve the generalization ability of the model. RAMLN leverages external memory to store feature information, which it utilizes to assign optimal margin values. Experimental results obtained on the extended Sketchy and TU-Berlin datasets show a sharp improvement over existing state-of-the-art methods in ZS-SBIR.

Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval

TL;DR

This paper tackles zero-shot sketch-based image retrieval by introducing RAMLN, a memory-augmented meta-learning framework that learns an adaptive margin for a relation-aware quadruplet loss. The loss combines inter-modal and intra-modal constraints with two negatives from different modalities to better separate classes and align sketches with photos, while a meta-learned margin stored in external memory enables strong generalization to unseen categories. An auxiliary cross-entropy objective stabilizes training, and experiments on Sketchy Extended and TU-Berlin Extended show clear improvements over state-of-the-art methods, validating both the loss design and the margin adaptation mechanism. Overall, RAMLN advances cross-modal metric learning for ZS-SBIR by enabling dynamic margin adaptation and leveraging memory to capture rare but discriminative features across seen classes, improving generalization to unseen classes in practice.

Abstract

Sketch-based image retrieval (SBIR) relies on free-hand sketches to retrieve natural photos within the same class. However, its practical application is limited by its inability to retrieve classes absent from the training set. To address this limitation, the task has evolved into Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), where model performance is evaluated on unseen categories. Traditional SBIR primarily focuses on narrowing the domain gap between photo and sketch modalities. However, in the zero-shot setting, the model not only needs to address this cross-modal discrepancy but also requires a strong generalization capability to transfer knowledge to unseen categories. To this end, we propose a novel framework for ZS-SBIR that employs a pair-based relation-aware quadruplet loss to bridge feature gaps. By incorporating two negative samples from different modalities, the approach prevents positive features from becoming disproportionately distant from one modality while remaining close to another, thus enhancing inter-class separability. We also propose a Relation-Aware Meta-Learning Network (RAMLN) to obtain the margin, a hyper-parameter of cross-modal quadruplet loss, to improve the generalization ability of the model. RAMLN leverages external memory to store feature information, which it utilizes to assign optimal margin values. Experimental results obtained on the extended Sketchy and TU-Berlin datasets show a sharp improvement over existing state-of-the-art methods in ZS-SBIR.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) and (b) respectively illustrate the performance of the traditional triplet loss method and our proposed approach in handling the ZS-SBIR task. The traditional triplet loss method tends to misclassify objects with similar shapes but belonging to different categories. For example, due to their similar shapes, it might mistakenly classify a parachute as a banana. In response, we propose a novel relation-aware quadruplet loss function to thoroughly explore both cross-modal and intra-modal relationships. Additionally, we employ a meta-learning strategy to optimize the margin in quadruplet loss, adaptively determining the optimal margin value. This approach not only enhances the model's generalization ability but also significantly improves its capacity to distinguish between objects with similar shapes, such as accurately differentiating a parachute from a banana.
  • Figure 2: The overall structure of the proposed method. The image encoder extracts features from both sketches and photos in the embedding space. Then the training architecture combines two parts: (i) Bidirectional training, which incorporates relation-aware quadruplets from different modalities to learn feature distributions, and optimizes using the margin obtained through meta-optimization. (ii) Classification training utilizes cross-entropy loss with the Softmax function, which helps to avoid getting stuck in local optima. Retrieval is based only on distance, classification training still makes sense.
  • Figure 3: T-SNE visualization of sketch and photo embeddings on Tuberlin Extended dataset. We randomly samples from 7 test categories for visualization. Different colors refer to different categories. Crosses denote photos and hexagons denote sketches. (a) is our purposed method and (b) is Bid-Tri. It shows that our method has better clustering effect.
  • Figure 4: Top 7 image retrieval examples of TU-Berlin and Sketchy. All of the examples come from unseen class. We use ticks and crosses to indicate right and wrong retrievals.