A Sketch+Text Composed Image Retrieval Dataset for Thangka
Jinyu Xu, Yi Sun, Jiangling Zhang, Qing Xie, Daomin Ji, Zhifeng Bao, Jiachen Li, Yanchun Ma, Yongjian Liu
TL;DR
CIRThan introduces a knowledge-specific sketch+text composed image retrieval benchmark for Thangka imagery, addressing the limitations of general-domain CIR datasets in handling fine-grained structural and symbolic semantics. It pairs 2,287 Thangka images with human-drawn sketches and three hierarchical textual descriptions, enabling queries with varying semantic granularity. Experiments show a substantial gap between supervised and zero-shot CIR in this domain, with richer textual descriptions consistently improving performance, and reveal that current multimodal language models struggle to ground sketch and domain-specific semantics without in-domain supervision. The dataset and findings advocate for advances in domain-aware retrieval, hierarchical semantic modeling, and user-centric multimodal query formulation for cultural heritage data and other knowledge-specific visual domains.
Abstract
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.
