Table of Contents
Fetching ...

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

Siyuan Yan, Ming Hu, Yiwen Jiang, Xieji Li, Hao Fei, Philipp Tschandl, Harald Kittler, Zongyuan Ge

TL;DR

Derm1M is the first large-scale vision-language dermatology dataset, spanning over 1M image-text pairs, 390 skin conditions, and 130 clinical concepts, all organized around an expert-curated ontology. The authors pretrain DermLIP, a family of CLIP-like models, and demonstrate substantial improvements over state-of-the-art biomedical and general VL models in zero-shot classification, few-shot/full-shot learning, and cross-modal retrieval. The work highlights robust data curation from diverse educational sources and ontology-driven knowledge augmentation to support clinically meaningful multimodal learning. By enabling hierarchical disease reasoning, rich contextual descriptions, and artifact/concept identification, Derm1M aims to accelerate practical AI tools for dermatology across education and clinical decision support.

Abstract

The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be publicly available at https://github.com/SiyuanYan1/Derm1M upon acceptance.

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

TL;DR

Derm1M is the first large-scale vision-language dermatology dataset, spanning over 1M image-text pairs, 390 skin conditions, and 130 clinical concepts, all organized around an expert-curated ontology. The authors pretrain DermLIP, a family of CLIP-like models, and demonstrate substantial improvements over state-of-the-art biomedical and general VL models in zero-shot classification, few-shot/full-shot learning, and cross-modal retrieval. The work highlights robust data curation from diverse educational sources and ontology-driven knowledge augmentation to support clinically meaningful multimodal learning. By enabling hierarchical disease reasoning, rich contextual descriptions, and artifact/concept identification, Derm1M aims to accelerate practical AI tools for dermatology across education and clinical decision support.

Abstract

The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be publicly available at https://github.com/SiyuanYan1/Derm1M upon acceptance.

Paper Structure

This paper contains 24 sections, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Overview of the Derm1M dataset. The first large-scale vision-language (VL) dataset for dermatology, comprises 1,029,761 skin image-text pairs. a) Diverse sources: Distribution across five types, including YouTube videos, medical forums, PubMed articles, public datasets, and educational materials. b) Comprehensive disease coverage: Frequency of top 15 skin conditions from the total 390 conditions, representing the complex and diverse range of skin diseases encountered in clinical practice. c) Rich text descriptions: Distribution of text lengths across image-text pairs, with a mean length of 41. d) Rich contextual information: Word cloud of common dermatological terms in Derm1M. e) Structured domain knowledge: Expert-developed ontology that conceptually organizes domain knowledge by defining standard entities and their hierarchical disease relationships. Overall, Derm1M advances dermatology AI through unprecedented scale (1M+ image-text pairs, 257× larger than existing VL datasets in dermatology zhou2024skincap), comprehensive coverage (390 skin conditions with 130 clinical concepts), and rich clinical information supporting multi-granular learning aligned with clinical practices.
  • Figure 2: The overview of the five-stage process for Derm1M dataset construction: (1) Multi-source data collection, (2) data preprocessing, (3) image-text cleaning, (4) image-text pairing, and (5) knowledge-enhanced text augmentation.
  • Figure 3: Comparison of DermLIP and MONET monet on artifact concept detection in the ISIC dataset isic. The top 10 images with the highest scores for common dermatological artifacts (ruler, nail, hair, markers). Red boxes highlight incorrect concept annotations.
  • Figure 4: Benchmarking on zero-shot clinical concept identification for SkinCon and Derm7pt datasets.
  • Figure 5: Curation pipeline for YouTube content. Our process begins with searching and collecting 51k videos from educational channels, followed by filtering to identify narrative-style content with high-quality explanations. We then extract and denoise text using a combination of speech-to-text models, handcrafted algorithms, and LLMs. Finally, we align the processed text with corresponding image pairs to create a curated dataset.
  • ...and 12 more figures