Table of Contents
Fetching ...

Are Explanations Helpful? A Comparative Analysis of Explainability Methods in Skin Lesion Classifiers

Rosa Y. G. Paccotacya-Yanque, Alceu Bissoto, Sandra Avila

TL;DR

The paper tackles the challenge of explaining deep skin-lesion classifiers by comparing seven post-hoc explainability methods (four pixel-attribution: Grad-CAM, Score-CAM, LIME, SHAP; three concept-based: ACE, ICE, CME) on an Inception-v4 model trained with ISIC 2018 Task 3 data, achieving $89.96\% \pm 0.52$ ROC AUC. It formalizes three desiderata for explanations—fidelity, meaningfulness, and effectiveness—and evaluates how well each method meets them. Findings indicate pixel-attribution methods reveal biases and spurious correlations but often lack sufficient justification for predictions, while concept-based methods can provide higher-level but variable interpretability and fidelity (e.g., ICE $11.83\%$ relative error; CME $0.88$ ROC AUC). The study suggests that no single explainability approach suffices; a combined, clinician-informed strategy is more promising for trustworthy deployment, with future work including physician-perception studies and broader datasets and architectures.

Abstract

Deep Learning has shown outstanding results in computer vision tasks; healthcare is no exception. However, there is no straightforward way to expose the decision-making process of DL models. Good accuracy is not enough for skin cancer predictions. Understanding the model's behavior is crucial for clinical application and reliable outcomes. In this work, we identify desiderata for explanations in skin-lesion models. We analyzed seven methods, four based on pixel-attribution (Grad-CAM, Score-CAM, LIME, SHAP) and three on high-level concepts (ACE, ICE, CME), for a deep neural network trained on the International Skin Imaging Collaboration Archive. Our findings indicate that while these techniques reveal biases, there is room for improving the comprehensiveness of explanations to achieve transparency in skin-lesion models.

Are Explanations Helpful? A Comparative Analysis of Explainability Methods in Skin Lesion Classifiers

TL;DR

The paper tackles the challenge of explaining deep skin-lesion classifiers by comparing seven post-hoc explainability methods (four pixel-attribution: Grad-CAM, Score-CAM, LIME, SHAP; three concept-based: ACE, ICE, CME) on an Inception-v4 model trained with ISIC 2018 Task 3 data, achieving ROC AUC. It formalizes three desiderata for explanations—fidelity, meaningfulness, and effectiveness—and evaluates how well each method meets them. Findings indicate pixel-attribution methods reveal biases and spurious correlations but often lack sufficient justification for predictions, while concept-based methods can provide higher-level but variable interpretability and fidelity (e.g., ICE relative error; CME ROC AUC). The study suggests that no single explainability approach suffices; a combined, clinician-informed strategy is more promising for trustworthy deployment, with future work including physician-perception studies and broader datasets and architectures.

Abstract

Deep Learning has shown outstanding results in computer vision tasks; healthcare is no exception. However, there is no straightforward way to expose the decision-making process of DL models. Good accuracy is not enough for skin cancer predictions. Understanding the model's behavior is crucial for clinical application and reliable outcomes. In this work, we identify desiderata for explanations in skin-lesion models. We analyzed seven methods, four based on pixel-attribution (Grad-CAM, Score-CAM, LIME, SHAP) and three on high-level concepts (ACE, ICE, CME), for a deep neural network trained on the International Skin Imaging Collaboration Archive. Our findings indicate that while these techniques reveal biases, there is room for improving the comprehensiveness of explanations to achieve transparency in skin-lesion models.

Paper Structure

This paper contains 9 sections, 3 figures.

Figures (3)

  • Figure 1: Pixel attribution results: Yellow indicates relevance for the prediction. In LIME, green highlights positive contributions. In SHAP, green pixels contribute positively, while red pixels contribute negatively.
  • Figure 2: Concept-based explanations results.
  • Figure 3: Saliency results for predictions with high confidence to test fidelity. First row: Melanoma class, second row: Benign class.