MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis
Sadia Kamal, Tim Oates
TL;DR
The paper addresses the need for trustworthy AI in skin lesion diagnosis by combining a vision-language framework (CLIP) with a novel entropy-weighted gradient explainability mechanism (MedGrad E-CLIP). It trains CLIP on dermoscopic images and diagnostic criteria, and introduces a weighted entropy approach to highlight regions that align with textual descriptions, improving interpretability without sacrificing accuracy. Experiments on PH2 and Derm7pt demonstrate competitive classification performance and enhanced, fine-grained explanations compared to Grad-CAM and Grad E-CLIP. This work advances clinically relevant, explainable AI in dermatology and sets the stage for more robust multimodal diagnostic tools in medical imaging.
Abstract
As deep learning models gain attraction in medical data, ensuring transparent and trustworthy decision-making is essential. In skin cancer diagnosis, while advancements in lesion detection and classification have improved accuracy, the black-box nature of these methods poses challenges in understanding their decision processes, leading to trust issues among physicians. This study leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on different skin lesion datasets, to capture meaningful relationships between visual features and diagnostic criteria terms. To further enhance transparency, we propose a method called MedGrad E-CLIP, which builds on gradient-based E-CLIP by incorporating a weighted entropy mechanism designed for complex medical imaging like skin lesions. This approach highlights critical image regions linked to specific diagnostic descriptions. The developed integrated pipeline not only classifies skin lesions by matching corresponding descriptions but also adds an essential layer of explainability developed especially for medical data. By visually explaining how different features in an image relates to diagnostic criteria, this approach demonstrates the potential of advanced vision-language models in medical image analysis, ultimately improving transparency, robustness, and trust in AI-driven diagnostic systems.
