Table of Contents
Fetching ...

Are generative models fair? A study of racial bias in dermatological image generation

Miguel López-Pérez, Søren Hauberg, Aasa Feragen

TL;DR

This paper investigates whether generative models used in dermatology are fair across racial groups by training a Variational Autoencoder with perceptual loss on the Fitzpatrick17k dataset and evaluating reconstruction performance across lighter and darker skin tones. The study demonstrates that subgroup representation strongly influences reconstruction quality, with a consistent bias favoring lighter skin even when training is balanced, and reveals that the model’s latent uncertainty fails to serve as a reliable flag for underrepresentation. The work highlights critical gaps in dataset representativeness and uncertainty quantification, underscoring the need for better data and fairness-aware evaluation to enable trustworthy AI tools in healthcare. It also points to the potential role of disease distribution differences across skin tones in contributing to observed disparities and advocates for future research on disentangling skin-tone effects from pathology signals.

Abstract

Racial bias in medicine, such as in dermatology, presents significant ethical and clinical challenges. This is likely to happen because there is a significant underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that VAE performance is, as expected, influenced by representation, i.e. increased skin tone representation comes with increased performance on the given skin tone. However, we also observe, even independently of representation, that the VAE performs better for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model's fairness. These results highlight the need for more representative dermatological datasets, but also a need for better understanding the sources of bias in such model, as well as improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.

Are generative models fair? A study of racial bias in dermatological image generation

TL;DR

This paper investigates whether generative models used in dermatology are fair across racial groups by training a Variational Autoencoder with perceptual loss on the Fitzpatrick17k dataset and evaluating reconstruction performance across lighter and darker skin tones. The study demonstrates that subgroup representation strongly influences reconstruction quality, with a consistent bias favoring lighter skin even when training is balanced, and reveals that the model’s latent uncertainty fails to serve as a reliable flag for underrepresentation. The work highlights critical gaps in dataset representativeness and uncertainty quantification, underscoring the need for better data and fairness-aware evaluation to enable trustworthy AI tools in healthcare. It also points to the potential role of disease distribution differences across skin tones in contributing to observed disparities and advocates for future research on disentangling skin-tone effects from pathology signals.

Abstract

Racial bias in medicine, such as in dermatology, presents significant ethical and clinical challenges. This is likely to happen because there is a significant underrepresentation of darker skin tones in training datasets for machine learning models. While efforts to address bias in dermatology have focused on improving dataset diversity and mitigating disparities in discriminative models, the impact of racial bias on generative models remains underexplored. Generative models, such as Variational Autoencoders (VAEs), are increasingly used in healthcare applications, yet their fairness across diverse skin tones is currently not well understood. In this study, we evaluate the fairness of generative models in clinical dermatology with respect to racial bias. For this purpose, we first train a VAE with a perceptual loss to generate and reconstruct high-quality skin images across different skin tones. We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models. Our findings indicate that VAE performance is, as expected, influenced by representation, i.e. increased skin tone representation comes with increased performance on the given skin tone. However, we also observe, even independently of representation, that the VAE performs better for lighter skin tones. Additionally, the uncertainty estimates produced by the VAE are ineffective in assessing the model's fairness. These results highlight the need for more representative dermatological datasets, but also a need for better understanding the sources of bias in such model, as well as improved uncertainty quantification mechanisms to detect and address racial bias in generative models for trustworthy healthcare technologies.
Paper Structure (24 sections, 2 equations, 8 figures)

This paper contains 24 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: Example predictions from our VAE model trained on a balanced subset of the Fitzpatrick17k dataset, comprising 50/50 light and dark skin tones. Notably, predictions for lighter tones are more accurate and better preserve the lesion compared to those for darker tones.
  • Figure 2: Distribution of samples according to the FST in the Fitzpatrick17k dataset groh2021evaluating. The label '-1' represents missing values.
  • Figure 3: Likelihood or Mean Square Error (MSE) of the VAEs in the test sets.
  • Figure 4: Example reconstruction of lighter skin tones. Reconstructions are produced using three training set configurations: 'A', 'B', and 'C'.
  • Figure 5: Example reconstruction of darker skin tones. Reconstructions are produced using three training set configurations: 'A', 'B', and 'C'.
  • ...and 3 more figures