A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan; Xieji Li; Dan Mo; Philipp Tschandl; Yiwen Jiang; Zhonghua Wang; Ming Hu; Lie Ju; Cristina Vico-Alonso; Yizhen Zheng; Jiahe Liu; Juexiao Zhou; Camilla Chello; Jen G. Cheung; Julien Anriot; Luc Thomas; Clare Primiero; Gin Tan; Aik Beng Ng; Simon See; Xiaoying Tang; Albert Ip; Xiaoyang Liao; Adrian Bowling; Martin Haskett; Shuang Zhao; Monika Janda; H. Peter Soyer; Victoria Mar; Harald Kittler; Zongyuan Ge

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan, Xieji Li, Dan Mo, Philipp Tschandl, Yiwen Jiang, Zhonghua Wang, Ming Hu, Lie Ju, Cristina Vico-Alonso, Yizhen Zheng, Jiahe Liu, Juexiao Zhou, Camilla Chello, Jen G. Cheung, Julien Anriot, Luc Thomas, Clare Primiero, Gin Tan, Aik Beng Ng, Simon See, Xiaoying Tang, Albert Ip, Xiaoyang Liao, Adrian Bowling, Martin Haskett, Shuang Zhao, Monika Janda, H. Peter Soyer, Victoria Mar, Harald Kittler, Zongyuan Ge

TL;DR

DermFM-Zero is a dermatology vision-language foundation model trained on over 4 million multimodal data points using masked latent modelling and bootstrapped contrastive learning. It achieves state-of-the-art zero-shot performance across diverse benchmarks and demonstrates robust, task-agnostic clinical support without fine-tuning. Three multinational reader studies show substantial improvements in primary care differential diagnosis and specialist multimodal skin cancer assessment, with a notable skill-leveling effect in collaborative workflows. The model also provides interpretable latent concepts via Sparse Autoencoders, enabling automatic discovery of clinically meaningful features and targeted suppression of artifact-induced biases, contributing to safer and more robust clinical decision support in dermatology.

Abstract

Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

TL;DR

Abstract

Paper Structure (3 sections, 2 equations, 15 figures, 20 tables)

This paper contains 3 sections, 2 equations, 15 figures, 20 tables.

Reader study 1: Human-AI collaboration for general dermatology conditions in primary care.
Reader study 2A: Expert-Level benchmarking in specialist care.
Reader study 2B: Multimodal human--AI collaboration in specialist care.

Figures (15)

Figure 1: DermFM-Zero pretraining dataset and evaluation framework.a, b, The vision-language pretraining dataset. a, Pretraining data statistics, showing: (i) top skin conditions, (ii) clinical concepts, (iii) text length distribution, and (iv) common corpus terms. b, Image-text data sources: curated public (Derm1M, n=240,246; Edu, n=199,600) and private (MoleMap; n=574,239) collections. c, DermFM-Zero pretraining schematic, using multimodal data (dermoscopy, clinical, mobile photos) and text (demographics, medical history, symptoms) with unimodal self-supervised and multimodal contrastive learning objectives. d, The three-stage evaluation framework. Eval 1: Evaluation on 17 benchmarks (e.g., zero/few-shot classification, cross-modal retrieval, VQA). Eval 2: Clinical validation via three zero-shot reader studies: (I) Human-AI collaboration in primary care (n=30 PCPs vs. PCPs + DermFM-Zero in skin condition differential diagnosis). (II) Standalone AI vs. 1,073 clinicians for multimodal skin cancer diagnosis. (III) Human-AI collaboration in specialty care for skin cancer diagnosis and management (n=34). Eval 3: Automated concept discovery for transparent AI applications using SAE. All icons are from Flaticon.com.
Figure 1: Label-efficient generalization via linear probing.a, c, Label-efficient generalization performance of DermFM-Zero at various training data percentages across diverse tasks. Performance is compared against other medical foundation models (a) and a significantly larger natural-domain model (DINOv3) (c). b, Average performance improvement (from a) of DermFM-Zero over the second-best model (PanDerm) across different training data ratios. d, Performance versus parameter size, comparing DermFM-Zero (304M parameters) with DINOv3, which is 23$\times$ larger (7B parameters). In a, c, shaded bands indicate 95% CIs (centre line for the mean), computed via non-parametric bootstrapping (1,000 replicates). Pairwise significance was determined by a two-sided t-test (*P < 0.05, **P < 0.01, ***P < 0.001).
Figure 2: Zero-shot benchmark evaluation.a, Zero-shot image classification performance of DermFM-Zero and other vision-language foundation models across diverse modalities, tasks, and datasets. Metrics: AUROC for binary (c=2) and balanced accuracy for multi-class (c>2) datasets. b, Zero-shot image-to-text and text-to-image retrieval performance on the Derm1M and SkinCap datasets, measured using Recall@K (K=5, 10, 50). c, Summary ranking of models by average performance. Top: average zero-shot classification (from a). Bottom: average cross-modal retrieval (based on R@50, from b). d, T-SNE visualisation of feature embeddings from DermFM-Zero and other models for the top 20 classes of the SD-128 dataset. In a, b, bar centres represent the mean value and error bars show 95% CIs (computed via non-parametric bootstrapping, 1,000 replicates). Pairwise statistical significance was determined by a two-sided t-test (*P < 0.05, **P < 0.01, ***P < 0.001).
Figure 2: Performance evaluation under real-world multimodal settings.a, The real-world multimodal setting, integrating image modalities (clinical, dermoscopy) and structured text. b, c, Example image-text pairs from SCIN (b) and CombinMel (c), with text generated from metadata. d-f, Multimodal finetuning performance on Derm7pt (d; skin cancer diagnosis (C+D+T)), SCIN (e; skin condition classification (C+T)), and PAD (f; skin cancer diagnosis (C+T)). Left: Benchmarking DermFM-Zero against other models. Right: Modality ablation (C=clinical, D=dermoscopy, T=text). g, Modality ablation for binary and multi-class melanoma metastasis prediction (D+T). h, Kaplan-Meier curves for recurrence-free interval (RFI) in CombinMel (n=302), stratified by DermFM-Zero risk predictions. i, Forest plot of Hazard Ratios (HRs) for RFI, comparing DermFM-Zero with other clinical variables. j, Time-dependent ROC curves for 3, 5, and 7-year RFI prediction. In d-g, boxes show median and IQR; whiskers show 1.5$\times$IQR. In h, shaded areas are 95% CIs; P-value ($<0.001$) is from a log-rank test. In i, error bars are 95% CIs for HRs. In j, values in parentheses are 95% CIs for AUCs. Asterisks in d-f denote statistical significance (*P < 0.05, **P < 0.01, ***P < 0.001) from a two-sided t-test.
Figure 3: Reader study 1: Impact of DermFM-Zero zero-shot assistance on diagnostic accuracy and safety in primary care.a,b, Evaluation rubrics for diagnostic accuracy (a) and management quality (b), categorizing decisions from potential harm (score 1) to optimal care (score 4 or 5). c, Top-3 diagnostic utility, defined as the presence of the correct diagnosis within the top three differentials. d, Overall diagnostic accuracy scores comparing unaided versus AI-assisted performance. e, Global shift in management decision quality. Stacked bars show the proportion of decisions classified as Dangerous (red), Harmless (grey), and Adequate/Perfect (teal), illustrating a structural shift from potential harm to clinical competence. f, Reduction in harm rate (proportion of Inadequate/Dangerous decisions) per reader. Comparisons unaided vs. with DermFM-Zero support ($n=30$ primary care physicians) evaluating complex cases across 98 skin conditions using free-text inputs. Thick black lines connect mean values; thin grey lines represent individual readers. P-values from one-sided Wilcoxon signed-rank test. $*P < 0.05$, $**P < 0.01$.
...and 10 more figures

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

TL;DR

Abstract

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Authors

TL;DR

Abstract

Table of Contents

Figures (15)