Table of Contents
Fetching ...

MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis

Sadia Kamal, Tim Oates

TL;DR

The paper addresses the need for trustworthy AI in skin lesion diagnosis by combining a vision-language framework (CLIP) with a novel entropy-weighted gradient explainability mechanism (MedGrad E-CLIP). It trains CLIP on dermoscopic images and diagnostic criteria, and introduces a weighted entropy approach to highlight regions that align with textual descriptions, improving interpretability without sacrificing accuracy. Experiments on PH2 and Derm7pt demonstrate competitive classification performance and enhanced, fine-grained explanations compared to Grad-CAM and Grad E-CLIP. This work advances clinically relevant, explainable AI in dermatology and sets the stage for more robust multimodal diagnostic tools in medical imaging.

Abstract

As deep learning models gain attraction in medical data, ensuring transparent and trustworthy decision-making is essential. In skin cancer diagnosis, while advancements in lesion detection and classification have improved accuracy, the black-box nature of these methods poses challenges in understanding their decision processes, leading to trust issues among physicians. This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms. To further enhance transparency, we propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions. This approach highlights critical image regions linked to specific diagnostic descriptions. The developed integrated pipeline not only classifies skin lesions by matching corresponding descriptions but also adds an essential layer of explainability developed especially for medical data. By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis, ultimately improving transparency, robustness, and trust in AI-driven diagnostic systems.

MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis

TL;DR

The paper addresses the need for trustworthy AI in skin lesion diagnosis by combining a vision-language framework (CLIP) with a novel entropy-weighted gradient explainability mechanism (MedGrad E-CLIP). It trains CLIP on dermoscopic images and diagnostic criteria, and introduces a weighted entropy approach to highlight regions that align with textual descriptions, improving interpretability without sacrificing accuracy. Experiments on PH2 and Derm7pt demonstrate competitive classification performance and enhanced, fine-grained explanations compared to Grad-CAM and Grad E-CLIP. This work advances clinically relevant, explainable AI in dermatology and sets the stage for more robust multimodal diagnostic tools in medical imaging.

Abstract

As deep learning models gain attraction in medical data, ensuring transparent and trustworthy decision-making is essential. In skin cancer diagnosis, while advancements in lesion detection and classification have improved accuracy, the black-box nature of these methods poses challenges in understanding their decision processes, leading to trust issues among physicians. This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms. To further enhance transparency, we propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions. This approach highlights critical image regions linked to specific diagnostic descriptions. The developed integrated pipeline not only classifies skin lesions by matching corresponding descriptions but also adds an essential layer of explainability developed especially for medical data. By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis, ultimately improving transparency, robustness, and trust in AI-driven diagnostic systems.
Paper Structure (15 sections, 4 equations, 6 figures, 2 tables)

This paper contains 15 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: CLIP Overview for Custom Dataset: We encode skin lesion images and their descriptions to generate image and text embeddings. These are combined in a cross-modal interaction module, calculating cosine similarities to assess alignment between lesions and diagnoses, ensuring accurate classification.
  • Figure 2: Proposed pipeline
  • Figure 3: Test accuracy and loss on pre-trained model
  • Figure 4: Test accuracy and loss on fine-tuned model
  • Figure 5: Comparative visualization of explainability methods—Original Atypical Nevus, Pre-trained Grad E-CLIP, Trained Grad E-CLIP, MedGrad E-CLIP, and Grad-CAM on Atypical Nevus
  • ...and 1 more figures