Table of Contents
Fetching ...

Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images

Nusrat Munia, Abdullah-Al-Zubaer Imran

TL;DR

This work tackles bias in skin-disease diagnosis by proposing DermDiT, a diffusion-transformer that generates realistic, diverse dermoscopic images conditioned on prompts produced by Vision-Language Models from metadata. By operating in a latent space and using cross-attention with text embeddings, DermDiT synthesize high-quality images for underrepresented subgroups, enabling balanced datasets. Experimental results show favorable FID and MS-SSIM and improved recall and F1 in downstream classification when trained on synthetic data, suggesting reduced reliance on real data while maintaining diagnostic utility. Overall, the approach provides a practical path to mitigate diagnosis bias in dermatology through VLM-guided data augmentation with latent-diffusion methods.

Abstract

Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at https://github.com/Munia03/DermDiT

Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images

TL;DR

This work tackles bias in skin-disease diagnosis by proposing DermDiT, a diffusion-transformer that generates realistic, diverse dermoscopic images conditioned on prompts produced by Vision-Language Models from metadata. By operating in a latent space and using cross-attention with text embeddings, DermDiT synthesize high-quality images for underrepresented subgroups, enabling balanced datasets. Experimental results show favorable FID and MS-SSIM and improved recall and F1 in downstream classification when trained on synthetic data, suggesting reduced reliance on real data while maintaining diagnostic utility. Overall, the approach provides a practical path to mitigate diagnosis bias in dermatology through VLM-guided data augmentation with latent-diffusion methods.

Abstract

Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at https://github.com/Munia03/DermDiT

Paper Structure

This paper contains 10 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Proposed DermDiT: A diffusion transformer model generates new images conditioned on text prompts. These prompts are created by a Prompt Generator, which leverages a Vision-Language Model (VLM) to produce descriptive text based on input dermoscopic images and their associated metadata.
  • Figure 2: Sample image visualization from I.(a) ISIC real dataset and I. (b-f) synthetic images generated by the diffusion models. Visualization of the density distribution plot (II) and Principal Component Analysis (III) to compare the real ISIC data and the synthetic data generated by (a) Unconditional LDM, (b) Unconditional DiT, (c) LDM conditioned on LLaVA generated text prompt, (d) DiT conditioned on LLaVA generated text prompt, (e) DiT conditioned on LLaVA-Med generated text prompt.