Table of Contents
Fetching ...

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

TL;DR

This work addresses data scarcity in dermatology by enabling medically accurate skin-disease image generation through MAGIC, a semi-automated framework that leverages AI-expert collaboration. MAGIC uses expert-crafted clinical checklists evaluated by Multimodal LLMs to guide diffusion-model fine-tuning via two routes, RFT and DPO, and incorporates an Image-to-Image module to accelerate sampling while preserving anatomical context. Empirical results show substantial improvements in clinical fidelity (higher dermatologist-aligned scores, lower FID) and downstream diagnostic accuracy, including a +9.02 percentage-point gain for ResNet18 and a +5.12-point gain for DINOv2 on a 20-condition task, with pronounced benefits in few-shot scenarios. The approach reduces expert labeling workload, remains model-agnostic, and highlights a scalable path for applying foundation-model feedback to specialized medical imaging tasks.

Abstract

Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

TL;DR

This work addresses data scarcity in dermatology by enabling medically accurate skin-disease image generation through MAGIC, a semi-automated framework that leverages AI-expert collaboration. MAGIC uses expert-crafted clinical checklists evaluated by Multimodal LLMs to guide diffusion-model fine-tuning via two routes, RFT and DPO, and incorporates an Image-to-Image module to accelerate sampling while preserving anatomical context. Empirical results show substantial improvements in clinical fidelity (higher dermatologist-aligned scores, lower FID) and downstream diagnostic accuracy, including a +9.02 percentage-point gain for ResNet18 and a +5.12-point gain for DINOv2 on a 20-condition task, with pronounced benefits in few-shot scenarios. The approach reduces expert labeling workload, remains model-agnostic, and highlights a scalable path for applying foundation-model feedback to specialized medical imaging tasks.

Abstract

Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

Paper Structure

This paper contains 32 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Illustration of our proposed MAGIC: (a) A preliminary fine-tuned diffusion model (DM) transforms a source image (e.g., sarcoidosis) to a target condition (e.g., lupus erythematosus); an MLLM then provides expert checklist-based feedback scores on the generated image pair. (b) This feedback guides the subsequent fine-tuning (e.g., RFT or DPO) of the DM. (c) The feedback-enhanced DM synthesizes medically accurate dermatological images for robust classifier training.
  • Figure 2: Evolution of synthetic skin conditions generated by MAGIC-DPO, illustrating its ability to learn unique visual features from feedback across training iterations. The Top Row demonstrates the model transforming Sarcoidosis (SAR) into Erythema Multiforme (ERY), learning features like target (bull's-eye) lesions with concentric rings. The Middle Row demonstrates the model transforming Allergic Contact Dermatitis (ALL) into Lupus Erythematosus (LUP), progressively developing a butterfly rash covering the cheeks. The Bottom Row demonstrates the model transforming Granuloma Annulare (GRA) into Vitiligo (VIT), evolving to show characteristic depigmented patches.
  • Figure 3: Illustration of the image assessment process by OpenAI's GPT-4o using condition-specific checklists for target skin conditions such as lupus erythematosus, granuloma annulare, and vitiligo. Each generated image in a pair is evaluated against five clinical criteria. The image with more satisfied criteria is considered the preferred sample in a comparison. Additional examples are in Appendix \ref{['fig:app_image_pairs']}.
  • Figure 4: Experimental results showing (a) the impact of ratio $\rho$, (b) feedback volume on accuracy, (c) FID score comparison across different methods, and (d) evaluation results on synthetic data showing the percentage of criteria met. Our method consistently outperforms baseline methods in most metrics, achieving lower FID scores and higher criteria satisfaction rates.
  • Figure 5: Distribution of alignment scores indicating the number of checklist criteria met in the description.
  • ...and 3 more figures