Table of Contents
Fetching ...

When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

Loris Cino, Pier Luigi Mazzeo, Alessandro Martella, Giulia Radi, Renato Rossi, Cosimo Distante

Abstract

The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

Abstract

The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

Paper Structure

This paper contains 20 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Representative dermatoscopic images for each diagnostic category included in the ISIC 2019 dataset. The first row shows examples correctly classified by all neural network models, while the second row presents images misclassified by the models. Each column corresponds to one diagnostic class (AK, BCC, BKL, DF, MEL, NV, SCC, VASC).
  • Figure 2: Distribution of the number of images jointly misclassified by all models, obtained through a stratified permutation test used to assess whether the observed level of simultaneous mistakes exceeds what would be expected under the null hypothesis.
  • Figure 3: Examples of low-quality images excluded from the analysis. The first two images from the left (\ref{['fig:low_med_left']}, \ref{['fig:low_med_right']}) were considered to be of insufficient quality by expert dermatologists. The remaining two images (\ref{['fig:low_left']}, \ref{['fig:low_right']}) display blurred cases automatically identified by the blur score.
  • Figure 4: Effect of varying the blur-score threshold on the proportion of images classified as blurry. The curves compare the cumulative percentage of all images and dermatologist-annotated blurred images as the threshold increases.
  • Figure 5: Images exhibiting strong high-frequency components, identified via Fourier transform. These images are heavily occluded by hair artifacts, and some instances are exact duplicates.
  • ...and 6 more figures