Table of Contents
Fetching ...

SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning

Yuhao Shen, Liyuan Sun, Yan Xu, Wenbin Liu, Shuping Zhang, Shawn Afvari, Zhongyi Han, Jiaoyan Song, Yongzhi Ji, Tao Lu, Xiaonan He, Xin Gao, Juexiao Zhou

TL;DR

SkinCaRe addresses the need for interpretable dermatology AI by uniting SkinCAP, which provides dermatologist-authored observation-first captions for 4,000 images, with SkinCoT, which offers clinician-verified hierarchical chain-of-thought narratives for 3,041 images. The two branches share a common schema, ethics, and provenance, enabling joint supervision for description and reasoning in a single resource totaling 7,041 cases. Technical validation shows high-quality, bilingual captions and robust CoT reasoning (grand mean 4.9433 across six clinician-rated metrics, with 79% perfect scores). Public availability at HuggingFace supports training and evaluating multimodal models that describe skin findings and explain diagnostic paths, advancing trustworthy, clinically grounded dermatology AI.

Abstract

With the widespread application of artificial intelligence (AI), particularly deep learning (DL) and vision large language models (VLLMs), in skin disease diagnosis, the need for interpretability becomes crucial. However, existing dermatology datasets are limited in their inclusion of concept-level meta-labels, and none offer rich medical descriptions in natural language. This deficiency impedes the advancement of LLM-based methods in dermatologic diagnosis. To address this gap and provide a meticulously annotated dermatology dataset with comprehensive natural language descriptions, we introduce \textbf{SkinCaRe}, a comprehensive multimodal resource that unifies \textit{SkinCAP} and \textit{SkinCoT}. \textbf{SkinCAP} comprises 4,000 images sourced from the Fitzpatrick 17k skin disease dataset and the Diverse Dermatology Images dataset, annotated by board-certified dermatologists to provide extensive medical descriptions and captions. In addition, we introduce \textbf{SkinCoT}, a curated dataset pairing 3,041 dermatologic images with clinician-verified, hierarchical chain-of-thought (CoT) diagnoses. Each diagnostic narrative is rigorously evaluated against six quality criteria and iteratively refined until it meets a predefined standard of clinical accuracy and explanatory depth. Together, SkinCAP (captioning) and SkinCoT (reasoning), collectively referred to as SkinCaRe, encompass 7,041 expertly curated dermatologic cases and provide a unified and trustworthy resource for training multimodal models that both describe and explain dermatologic images. SkinCaRe is publicly available at https://huggingface.co/datasets/yuhos16/SkinCaRe.

SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning

TL;DR

SkinCaRe addresses the need for interpretable dermatology AI by uniting SkinCAP, which provides dermatologist-authored observation-first captions for 4,000 images, with SkinCoT, which offers clinician-verified hierarchical chain-of-thought narratives for 3,041 images. The two branches share a common schema, ethics, and provenance, enabling joint supervision for description and reasoning in a single resource totaling 7,041 cases. Technical validation shows high-quality, bilingual captions and robust CoT reasoning (grand mean 4.9433 across six clinician-rated metrics, with 79% perfect scores). Public availability at HuggingFace supports training and evaluating multimodal models that describe skin findings and explain diagnostic paths, advancing trustworthy, clinically grounded dermatology AI.

Abstract

With the widespread application of artificial intelligence (AI), particularly deep learning (DL) and vision large language models (VLLMs), in skin disease diagnosis, the need for interpretability becomes crucial. However, existing dermatology datasets are limited in their inclusion of concept-level meta-labels, and none offer rich medical descriptions in natural language. This deficiency impedes the advancement of LLM-based methods in dermatologic diagnosis. To address this gap and provide a meticulously annotated dermatology dataset with comprehensive natural language descriptions, we introduce \textbf{SkinCaRe}, a comprehensive multimodal resource that unifies \textit{SkinCAP} and \textit{SkinCoT}. \textbf{SkinCAP} comprises 4,000 images sourced from the Fitzpatrick 17k skin disease dataset and the Diverse Dermatology Images dataset, annotated by board-certified dermatologists to provide extensive medical descriptions and captions. In addition, we introduce \textbf{SkinCoT}, a curated dataset pairing 3,041 dermatologic images with clinician-verified, hierarchical chain-of-thought (CoT) diagnoses. Each diagnostic narrative is rigorously evaluated against six quality criteria and iteratively refined until it meets a predefined standard of clinical accuracy and explanatory depth. Together, SkinCAP (captioning) and SkinCoT (reasoning), collectively referred to as SkinCaRe, encompass 7,041 expertly curated dermatologic cases and provide a unified and trustworthy resource for training multimodal models that both describe and explain dermatologic images. SkinCaRe is publicly available at https://huggingface.co/datasets/yuhos16/SkinCaRe.
Paper Structure (20 sections, 3 figures, 5 tables)

This paper contains 20 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of the SkinCaRe curation workflow. SkinCaRe comprises two complementary branches curated under unified ethics, de-identification, and expert adjudication. The upper branch illustrates SkinCAP, which provides dermatologist-authored, observation-first medical captions for 4,000 dermatology images, produced via multi-center annotation, cross-validation, and bilingual quality control. The lower branch depicts SkinCoT, a collection of clinician-verified, hierarchical CoT diagnostic narratives paired with 3,041 images, created through structured multi-level reasoning and certification. Both branches adopt a shared schema and identifiers to ensure interoperability and consistent downstream use.
  • Figure 2: a) Distribution of samples for each type of skin disease in SkinCAP with sample size $\geq$ 20. b) Five randomly selected examples from the SkinCAP dataset. c) Distribution of samples across the Fitzpatrick Scale in SkinCAP. d) Illustration of the response of SkinGPT-4 fine-tuned with SkinCAP on a case of acne.
  • Figure 3: Clinician Evaluation Inference for SkinCoT. Schematic illustration of the interface used for blinded expert scoring of CoT diagnostic narratives. The layout presents (from left to right): the ground-truth diagnosis label, the case image, the CoT reasoning text, an optional remarks box, and six single-choice A–E rating panels (Accuracy, Safety, Medical Groundedness, Clinical Coverage, Reasoning Coherence, Description Precision). This interface enables standardized clinician evaluation and supports subsequent adjudication.