A Sketch+Text Composed Image Retrieval Dataset for Thangka

Jinyu Xu; Yi Sun; Jiangling Zhang; Qing Xie; Daomin Ji; Zhifeng Bao; Jiachen Li; Yanchun Ma; Yongjian Liu

A Sketch+Text Composed Image Retrieval Dataset for Thangka

Jinyu Xu, Yi Sun, Jiangling Zhang, Qing Xie, Daomin Ji, Zhifeng Bao, Jiachen Li, Yanchun Ma, Yongjian Liu

TL;DR

CIRThan introduces a knowledge-specific sketch+text composed image retrieval benchmark for Thangka imagery, addressing the limitations of general-domain CIR datasets in handling fine-grained structural and symbolic semantics. It pairs 2,287 Thangka images with human-drawn sketches and three hierarchical textual descriptions, enabling queries with varying semantic granularity. Experiments show a substantial gap between supervised and zero-shot CIR in this domain, with richer textual descriptions consistently improving performance, and reveal that current multimodal language models struggle to ground sketch and domain-specific semantics without in-domain supervision. The dataset and findings advocate for advances in domain-aware retrieval, hierarchical semantic modeling, and user-centric multimodal query formulation for cultural heritage data and other knowledge-specific visual domains.

Abstract

Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.

A Sketch+Text Composed Image Retrieval Dataset for Thangka

TL;DR

Abstract

A Sketch+Text Composed Image Retrieval Dataset for Thangka

Authors

TL;DR

Abstract

Table of Contents

Figures (3)