Table of Contents
Fetching ...

Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval

Eunyi Lyou, Doyeon Lee, Jooeun Kim, Joonseok Lee

TL;DR

The paper addresses zero-shot sketch-based image retrieval by introducing MA-SBIR, a modality-aware framework that disentangles semantic content from modality-specific signals and performs indirect, text-guided alignment via CLIP without requiring paired sketch-photo data. By explicitly encoding modality and semantic components and training with a combination of cross-modal, semantic, and reconstruction losses, the method achieves state-of-the-art results in category-level ZS-SBIR and robust performance in generalized ZS-SBIR. It also extends to fine-grained SBIR with targeted adaptations, though instance-level alignment remains challenging without direct paired supervision. The approach leverages unpaired multimodal data to bridge the modality gap, offering practical benefits for scalable cross-modal retrieval with real-world data constraints.

Abstract

Zero-shot learning offers an efficient solution for a machine learning model to treat unseen categories, avoiding exhaustive data collection. Zero-shot Sketch-based Image Retrieval (ZS-SBIR) simulates real-world scenarios where it is hard and costly to collect paired sketch-photo samples. We propose a novel framework that indirectly aligns sketches and photos by contrasting them through texts, removing the necessity of access to sketch-photo pairs. With an explicit modality encoding learned from data, our approach disentangles modality-agnostic semantics from modality-specific information, bridging the modality gap and enabling effective cross-modal content retrieval within a joint latent space. From comprehensive experiments, we verify the efficacy of the proposed model on ZS-SBIR, and it can be also applied to generalized and fine-grained settings.

Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval

TL;DR

The paper addresses zero-shot sketch-based image retrieval by introducing MA-SBIR, a modality-aware framework that disentangles semantic content from modality-specific signals and performs indirect, text-guided alignment via CLIP without requiring paired sketch-photo data. By explicitly encoding modality and semantic components and training with a combination of cross-modal, semantic, and reconstruction losses, the method achieves state-of-the-art results in category-level ZS-SBIR and robust performance in generalized ZS-SBIR. It also extends to fine-grained SBIR with targeted adaptations, though instance-level alignment remains challenging without direct paired supervision. The approach leverages unpaired multimodal data to bridge the modality gap, offering practical benefits for scalable cross-modal retrieval with real-world data constraints.

Abstract

Zero-shot learning offers an efficient solution for a machine learning model to treat unseen categories, avoiding exhaustive data collection. Zero-shot Sketch-based Image Retrieval (ZS-SBIR) simulates real-world scenarios where it is hard and costly to collect paired sketch-photo samples. We propose a novel framework that indirectly aligns sketches and photos by contrasting them through texts, removing the necessity of access to sketch-photo pairs. With an explicit modality encoding learned from data, our approach disentangles modality-agnostic semantics from modality-specific information, bridging the modality gap and enabling effective cross-modal content retrieval within a joint latent space. From comprehensive experiments, we verify the efficacy of the proposed model on ZS-SBIR, and it can be also applied to generalized and fine-grained settings.
Paper Structure (13 sections, 14 equations, 5 figures, 7 tables)

This paper contains 13 sections, 14 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of our Architecture. Given an image $\mathbf{x}$ (either a sketch or a photo) and a text $\mathbf{y}$, modality-specific encoders ($\mathbf{E}_{\text{img}}$ and $\mathbf{E}_{\text{txt}}$) embed them to $\mathbf{z}_{\text{img}}$ and $\mathbf{z}_{\text{txt}}$, respectively, where each $\mathbf{z}$ is a sum of semantic embedding $\mathbf{s}$ and modality encoding $\mathbf{m}$. We acquire semantic-only vectors ($\mathbf{s}_{\text{img}}$, $\mathbf{s}_{\text{txt}}$) from them by subtracting the modality encoding ($\mathbf{m}_{\text{img}}$, $\mathbf{m}_{\text{txt}}$). By adding the opposite modality encoding to the semantic embeddings, we reconstruct the image and text embeddings ($\mathbf{z}'_{\text{img}}$ and $\mathbf{z}'_{\text{txt}}$).
  • Figure 2: Overview of FG-SBIR model architecture. Unlike for categorical SBIR, fine-grained version takes paired sketches and photos as input to obtain latent vectors ($\mathbf{z}_\text{\{ph,sk\}}$), followed by step 2 and 3 described in \ref{['fig_sbir']}.
  • Figure 3: T-SNE visualization on Sketchy-Ext Liu_2017_CVPR
  • Figure 4: Categorical-ZS-SBIR. Top-5 Retrieved images on QuickDraw-Ext Dey_2019_CVPR. Correct samples are circled.
  • Figure 5: FG-ZS-SBIR. Top-5 Retrieved images on Sketchy Dey_2019_CVPR. Correct samples are circled.