Table of Contents
Fetching ...

Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

Ko Watanabe, Stanislav Frolov, Aya Hassan, David Dembinsky, Adriano Lucieri, Andreas Dengel

TL;DR

The paper tackles fairness auditing for AI-based skin lesion classifiers by developing a diffusion-based, attribute-controlled synthetic data generator (LightningDiT) to create demographically balanced dermoscopic cohorts. It demonstrates that synthetic cohorts reproduce bias patterns observed with real data across sex, age, and skin type while enabling controlled, privacy-preserving fairness evaluations on three pretrained melanoma classifiers. The study highlights both the promise and limitations of synthetic data for fairness testing, including potential dataset-shift effects and the need for quality control and prospective validation. Overall, the approach offers a practical workflow for systematic fairness audits in medical imaging and suggests paths to extend to multi-class diagnoses and fairness-driven model training.

Abstract

Recent advances in deep learning and on-device inference could transform routine screening for skin cancers. Along with the anticipated benefits of this technology, potential dangers arise from unforeseen and inherent biases. A significant obstacle is building evaluation datasets that accurately reflect key demographics, including sex, age, and race, as well as other underrepresented groups. To address this, we train a state-of-the-art generative model to generate synthetic data in a controllable manner to assess the fairness of publicly available skin cancer classifiers. To evaluate whether synthetic images can be used as a fairness testing dataset, we prepare a real-image dataset (MILK10K) as a benchmark and compare the True Positive Rate result of three models (DeepGuide, MelaNet, and SkinLesionDensnet). As a result, the classification tendencies observed in each model when tested on real and generated images showed similar patterns across different attribute data sets. We confirm that highly realistic synthetic images facilitate model fairness verification.

Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis

TL;DR

The paper tackles fairness auditing for AI-based skin lesion classifiers by developing a diffusion-based, attribute-controlled synthetic data generator (LightningDiT) to create demographically balanced dermoscopic cohorts. It demonstrates that synthetic cohorts reproduce bias patterns observed with real data across sex, age, and skin type while enabling controlled, privacy-preserving fairness evaluations on three pretrained melanoma classifiers. The study highlights both the promise and limitations of synthetic data for fairness testing, including potential dataset-shift effects and the need for quality control and prospective validation. Overall, the approach offers a practical workflow for systematic fairness audits in medical imaging and suggests paths to extend to multi-class diagnoses and fairness-driven model training.

Abstract

Recent advances in deep learning and on-device inference could transform routine screening for skin cancers. Along with the anticipated benefits of this technology, potential dangers arise from unforeseen and inherent biases. A significant obstacle is building evaluation datasets that accurately reflect key demographics, including sex, age, and race, as well as other underrepresented groups. To address this, we train a state-of-the-art generative model to generate synthetic data in a controllable manner to assess the fairness of publicly available skin cancer classifiers. To evaluate whether synthetic images can be used as a fairness testing dataset, we prepare a real-image dataset (MILK10K) as a benchmark and compare the True Positive Rate result of three models (DeepGuide, MelaNet, and SkinLesionDensnet). As a result, the classification tendencies observed in each model when tested on real and generated images showed similar patterns across different attribute data sets. We confirm that highly realistic synthetic images facilitate model fairness verification.

Paper Structure

This paper contains 20 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Real melanoma test data (left) such as MILK10k MILK10k_2025 shows limited and imbalanced demographic coverage, while synthetic data (right) can provide complete combinations with balanced sampling, enabling more reliable fairness assessment.
  • Figure 2: Overall pipeline of our fairness testing workflow. We first train a generative model based on LightningDiT yao2025reconstruction using the dataset. Synthetic melanoma test images are then generated via systematic prompt templates. Finally, the synthetic images are used to assess fairness as measured via difference in pre-trained melanoma detection models.
  • Figure 3: Synthetic melanoma images generated by our model. Rows represent Fitzpatrick skin types (I--VI) combined with sex, and columns represent age groups (10--80). The grid demonstrates coverage of diverse demographic groups for fairness assessment.
  • Figure 4: of the skin lesion classifier across different (sex, age, and Fitzpatrick skin type) groups. The result present that both real and synthetic image testing perform a similar bias trend for each DeepGuide, MelaNet, and SkinLesionDensnet.
  • Figure 5: Cumulative curves for the PII attributes on MILK10K (real) and our synthetic images using MelaNet. The results show that the converges to similar levels, particularly for sex and skin type, indicating that synthetic images can serve as a suitable fairness evaluation dataset.