Table of Contents
Fetching ...

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

Kumar Abhishek, Aditi Jain, Ghassan Hamarneh

TL;DR

This work addresses critical data quality problems in three large dermatology image datasets by systematically identifying data leakage, duplicates, mislabeled images, and nonstandard partitions. It introduces corrected and extended datasets—DermaMNIST-C, DermaMNIST-E, and Fitzpatrick17k-C—along with standardized holdout partitions to enable fair, reproducible benchmarking. The authors combine automated duplicate detection with manual verification, leverage embedding-based similarity, and map labels to ICD-11/SNOMED-CT to reveal labeling inconsistencies, removing problematic images and clusters. Through redesigned benchmarks and public code, the study highlights how data quality directly affects performance claims and provides a concrete path toward more robust evaluation in dermatology AI. The work emphasizes reproducibility and dataset curation as foundational to trustworthy AI deployment in clinical contexts.

Abstract

The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

TL;DR

This work addresses critical data quality problems in three large dermatology image datasets by systematically identifying data leakage, duplicates, mislabeled images, and nonstandard partitions. It introduces corrected and extended datasets—DermaMNIST-C, DermaMNIST-E, and Fitzpatrick17k-C—along with standardized holdout partitions to enable fair, reproducible benchmarking. The authors combine automated duplicate detection with manual verification, leverage embedding-based similarity, and map labels to ICD-11/SNOMED-CT to reveal labeling inconsistencies, removing problematic images and clusters. Through redesigned benchmarks and public code, the study highlights how data quality directly affects performance claims and provides a concrete path toward more robust evaluation in dermatology AI. The work emphasizes reproducibility and dataset curation as foundational to trustworthy AI deployment in clinical contexts.

Abstract

The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
Paper Structure (41 sections, 17 figures, 4 tables)

This paper contains 41 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: DermaMNIST analyses: (a, b) show instances of and reasons for the data leakage, and (c) visualizes how the three datasets: DermaMNIST, DermaMNIST-C, and DermaMNIST-E differ in their partition composition, yet have similarly proportionate diagnosis distributions. Images from DermaMNIST are licensed under CC BY-NC 4.0 yang2021medmnistyang2023medmnist. Best viewed online.
  • Figure 2: Visualizing how DermaMNIST's incorrect resizing operation leads to loss of information. DermaMNIST's approach (top row) to generating $224 \times 224$ images results in visibly pixelated images. Our approach (bottom row), used for both DermaMNIST-C and DermaMNIST-E, retains much more detailed information. Images from DermaMNIST are licensed under CC BY-NC 4.0 yang2021medmnistyang2023medmnist. Best viewed online.
  • Figure 3: Visualizing the four scenarios that a pair of images from HAM10000 can be assigned to in duplicate detection, based on the metadata and the fastdup-based duplicate detection followed by manual review. "Confirmed duplicates", as the name suggests, are pairs that are images of the same lesion, indicated by the same lesion IDs in the metadata. Similarly, "True non-duplicates" are pairs of images that belong to different lesions. "Missed duplicates" refer to image pairs that have differing lesion IDs according to the metadata, but their high visual similarity (measured by cosine similarity of their image embeddings) followed by manual review confirms that these are indeed images of the same lesion, and were therefore 'missed' by the metadata. Finally, "False duplicates" refer to pairs where images share the same lesion IDs but do not belong to the same lesion. In our analysis, we did not find any instances of "False duplicates" in HAM10000. For all these sample images, the image IDs and the lesion IDs are along the horizontal and the vertical axis, respectively. Images from HAM10000 are licensed under CC BY-NC 4.0 tschandl2018ham10000.
  • Figure 4: Analysis of the top 1,000 most similar pairs in HAM10000 detected by fastdup: in intervals of 100 images, we calculate how many of these 100 purported duplicate image pairs are not already present in the HAM10000 metadata, and manually review those to detect which of these are "Missed duplicates" (i.e., pairs where the two images have different lesion IDs, but are actually images of the same lesion; Fig. \ref{['fig:ham10000_confusion_matrix']}) and those that are "True non-duplicates" (i.e., pairs where the two images have different lesion IDs but are indeed images of different lesions; Fig. \ref{['fig:ham10000_confusion_matrix']}). For example, looking at the 301--400 range, we find that from the 301st to the 400th most similar image pairs detected by fastdup, 44 pairs contained images that did not belong to the same lesion ID according to the HAM10000 metadata. Of these 44 pairs, manual inspection revealed 3 pairs to be newly discovered "Confirmed Duplicates", whereas the remaining 41 pairs were images of different lesion and were therefore "False Positives". There were 18 confirmed duplicate image pairs detected in HAM10000 and they have been visualized in Fig. \ref{['fig:ham10000_most_similar']}.
  • Figure 5: Visualizing the 18 "Missed duplicates" (Fig. \ref{['fig:ham10000_confusion_matrix']}) in HAM10000 obtained through the analysis of the top 1,000 most similar image pairs (Fig. \ref{['fig:ham10000_duplicates_bar_chart']}). These 18 pairs of images (image IDs along the horizontal axis) should belong to different lesions (lesion IDs along the vertical axis) according to the metadata, but manual review shows that both images in these pairs belong to the same lesions, and are thus, duplicate image pairs. Images from HAM10000 are licensed under CC BY-NC 4.0 tschandl2018ham10000.
  • ...and 12 more figures