Table of Contents
Fetching ...

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

Abstract

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

Abstract

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

Paper Structure

This paper contains 21 sections, 13 equations, 5 figures, 2 algorithms.

Figures (5)

  • Figure 1: a, The process of clinicians formulating a comprehensive Diagnosis Report: Clinicians formulate a Diagnosis Report by integrating three core pillars of medical intelligence: Personal Clinical Experience, Standard Medical Literature, and the Recall of Similar Historical Cases. This holistic synthesis, further augmented by Online Verification of the latest treatments, enables a validated Treatment Plan. b, The Architecture of the SkinGPT-X System: Upon receiving a disease image and a query, the Vision Agent extracts fine-grained Visual Findings, while the Diagnosis Diagnosis Agent performs an initial Pre-diagnosis Hypotheses. This process is anchored by the RAG, which retrieves Local Medical Knowledge to ensure evidence-based reasoning. To incorporate empirical experience, the system utilizes the self-evolving agent memory, where the Top-5 Similar Cases and Diagnostic Guidelines are retrieved from the Historical Case Graph Database. The visual features, retrieved guidelines, and historical precedents are synthesized by the Case-Review Agent, which conducts a rigorous cross-reference to produce a validated Report. The Summarize Agent ensures self-evolving agent memory by integrating new confirmed cases into the Dynamic Repository, iteratively distill the diagnostic guidelines without retraining.
  • Figure 2: a, Performance comparison of SkinGPT-X versus four state-of-the-art models (MedGemma, Hulu-Med, Qwen3-VL, and PanDerm) across four benchmark skin disease datasets. $c$ indicates the class number; $n$ represents the data size. Metrics include ACC, Weighted F1, MCC, and Cohen’s Kappa. The error bars represent 95% CIs, and $P$ values were calculated using a two-sided $t$-test to indicate statistical significance between SkinGPT-X and the second-best model. b, Case study of diagnostic case reviewing and visual findings generation. The panels display the text-based outputs from MedGemma, Hulu-Med, and SkinGPT-X for representative clinical cases. The results illustrate that SkinGPT-X provides more reliable and transparent diagnostics by reviewing current clinical findings with its self-evolving agent memory.
  • Figure 3: a, Data distribution and hierarchical composition of the dataset series. This figure illustrates the incremental expansion of skin disease categories within the Dermnet dataset, scaling from Dermnet to the comprehensive Dermnet498. Each colored block represents a unique disease class. b, Comparative performance metrics on the Dermnet498 dataset. The bar chart presents ACC, MCC, Weighted F1, and Kappa scores for MedGemma, Hulu-Med, Qwen3-VL, PanDerm, and SkinGPT-X. Numerical values and error bars represent the mean and confidence intervals, with statistical significance between the top two models indicated by $P$-values. The radar chart displays the composition of Dermnet498. The legend specifies the top categories and their respective sample counts.c, Performance trends across varying category scales. The line graphs plot the changes in ACC, MCC, Weighted F1, and Kappa for the PanDerm and SkinGPT-X models as the number of disease categories increases.
  • Figure 4: a, Construction pipeline of the Rare Skin Disease Dataset. The dataset integrates diverse data sources, including medical libraries, clinical centers, and specialized guidelines to ensure high-quality labels for rare dermatological conditions. The RSDD dataset comprises 8 rare conditions, including Cutaneous Neuroendocrine Carcinoma ($n=115$), Generalized Pustular Psoriasis ($n=110$), and Behcet's Disease ($n=81$), among others. b, Multi-dimensional performance comparison on the RSDD ($n=564$). The radar chart illustrates the diagnostic efficacy of SkinGPT-X versus MedGemma, Hulu-Med, Qwen3-VL, and PanDerm across five metrics: ACC, MCC, Macro F1, Weighted F1, and Kappa. c, Representative case studies of rare disease diagnosis and differential reasoning. EvoDerma-Mem enhances the capability of the system differentiating the correct diagnosis form rare candidate skin diseases
  • Figure 5: a, Performance comparison between SkinGPT-X and the baseline framework (without memory) on Fitzpatrick-17k and Dermnet498 datasets. Data are presented as mean metrics with 95% confidence intervals; statistical significance was determined via two-sided t-tests ($p < 0.001$). b, Spatiotemporal evolution trajectory of diagnostic guidelines. The bubble color intensity represent degree of knowledge refinement, where deeper shades indicate more substantial updates to the diagnostic guidelines. c, The standardized Physician Evaluation Form used for blinded clinical review, focusing on Rigorousness of Medical Logic, Validity of Diagnostic guidelines Completeness and Rationality of Clinical Manifestation Refinement. d, Results of physician validation across three critical dimensions.