Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval
Eunyi Lyou, Doyeon Lee, Jooeun Kim, Joonseok Lee
TL;DR
The paper addresses zero-shot sketch-based image retrieval by introducing MA-SBIR, a modality-aware framework that disentangles semantic content from modality-specific signals and performs indirect, text-guided alignment via CLIP without requiring paired sketch-photo data. By explicitly encoding modality and semantic components and training with a combination of cross-modal, semantic, and reconstruction losses, the method achieves state-of-the-art results in category-level ZS-SBIR and robust performance in generalized ZS-SBIR. It also extends to fine-grained SBIR with targeted adaptations, though instance-level alignment remains challenging without direct paired supervision. The approach leverages unpaired multimodal data to bridge the modality gap, offering practical benefits for scalable cross-modal retrieval with real-world data constraints.
Abstract
Zero-shot learning offers an efficient solution for a machine learning model to treat unseen categories, avoiding exhaustive data collection. Zero-shot Sketch-based Image Retrieval (ZS-SBIR) simulates real-world scenarios where it is hard and costly to collect paired sketch-photo samples. We propose a novel framework that indirectly aligns sketches and photos by contrasting them through texts, removing the necessity of access to sketch-photo pairs. With an explicit modality encoding learned from data, our approach disentangles modality-agnostic semantics from modality-specific information, bridging the modality gap and enabling effective cross-modal content retrieval within a joint latent space. From comprehensive experiments, we verify the efficacy of the proposed model on ZS-SBIR, and it can be also applied to generalized and fine-grained settings.
