Table of Contents
Fetching ...

An analysis of data variation and bias in image-based dermatological datasets for machine learning classification

Francisco Filho, Emanoel Santos, Rodrigo Mota, Kelvin Cunha, Fabio Papais, Amanda Arruda, Mateus Baltazar, Camila Vieira, José Gabriel Tavares, Rafael Barros, Othon Souza, Thales Bezerra, Natalia Lopes, Érico Moutinho, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

TL;DR

This paper assesses the gap between dermoscopic and clinical dermatology datasets and how data variation and bias affect machine learning classification. By evaluating multiple CNN architectures under three training schemes—FullDermoscopic, FullClinical, and FineClinic (dermoscopic pretraining with 30% clinical data)—the authors quantify domain shift impacts and demonstrate that fine-tuning on clinical data significantly improves clinical performance, while models trained on clinical data struggle on dermoscopic data. The study highlights pronounced class-imbalance effects on malignant lesions and shows cross-domain generalization remains challenging without targeted adaptation. The findings emphasize the need for domain-aware evaluation and data balancing when deploying AI in clinical dermatology to ensure reliable, equitable performance across diverse patient populations.

Abstract

AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users' smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model's performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model's prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model's final accuracy.

An analysis of data variation and bias in image-based dermatological datasets for machine learning classification

TL;DR

This paper assesses the gap between dermoscopic and clinical dermatology datasets and how data variation and bias affect machine learning classification. By evaluating multiple CNN architectures under three training schemes—FullDermoscopic, FullClinical, and FineClinic (dermoscopic pretraining with 30% clinical data)—the authors quantify domain shift impacts and demonstrate that fine-tuning on clinical data significantly improves clinical performance, while models trained on clinical data struggle on dermoscopic data. The study highlights pronounced class-imbalance effects on malignant lesions and shows cross-domain generalization remains challenging without targeted adaptation. The findings emphasize the need for domain-aware evaluation and data balancing when deploying AI in clinical dermatology to ensure reliable, equitable performance across diverse patient populations.

Abstract

AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users' smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model's performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model's prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model's final accuracy.
Paper Structure (14 sections, 1 figure, 3 tables)

This paper contains 14 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Examples of images in the dermoscopic ISIC18 dataset (top row) and PAD-UFES-20 (bottom row). While clinical features impact model decisions, it is evident how pixel differences arise from intrinsic characteristics of each domain (e.g., capture device quality, lighting, noise, resolution).