SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning
Yuhao Shen, Jiahe Qian, Zhangtianyi Chen, Yuanhao He, Juexiao Zhou
TL;DR
SkinGPT-R1 addresses the need for explicit, verifiable reasoning in dermatology-capable vision-language systems by introducing DermCoT, a dermatology-centered CoT corpus, and DermEval with DermBench to align and benchmark clinician-rated reasoning quality. The architecture uses a frozen Vision-R1 backbone augmented by adapter-only visual distillation and a low-rank language bias, enabling dermatology priors and evidence-first narratives without incurring latency penalties. Empirical results show SkinGPT-R1 achieves leading performance on DermBench across six clinician-defined dimensions and yields stable zero-shot accuracy gains on three dermatology classification benchmarks, with ablations confirming the complementary value of DermCoT supervision and visual distillation. The work offers a practical, efficient pathway for domain-specific chain-of-thought modeling in dermatology and suggests a transferable framework for other image-driven medical specialties.
Abstract
We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
