MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment
Yequan Bie, Luyang Luo, Hao Chen
TL;DR
MICA addresses the need for explainable skin lesion diagnosis by integrating multi-level image-concept alignment with a concept bottleneck. It combines a CNN image encoder and an LLM-based concept encoder to align global, regional, and subspace concept representations, enabling both visual and textual explanations. The method demonstrates improved disease-diagnosis performance and robust concept detection across three skin datasets, while maintaining interpretability and offering test-time concept interventions for faithful explanations. This approach advances practical, ante-hoc XAI in dermatology, with potential to improve clinician trust and diagnostic efficiency through human-interpretable reasoning and adjustable, concept-driven decisions.
Abstract
Black-box deep learning approaches have showcased significant potential in the realm of medical image analysis. However, the stringent trustworthiness requirements intrinsic to the medical field have catalyzed research into the utilization of Explainable Artificial Intelligence (XAI), with a particular focus on concept-based methods. Existing concept-based methods predominantly apply concept annotations from a single perspective (e.g., global level), neglecting the nuanced semantic relationships between sub-regions and concepts embedded within medical images. This leads to underutilization of the valuable medical information and may cause models to fall short in harmoniously balancing interpretability and performance when employing inherently interpretable architectures such as Concept Bottlenecks. To mitigate these shortcomings, we propose a multi-modal explainable disease diagnosis framework that meticulously aligns medical images and clinical-related concepts semantically at multiple strata, encompassing the image level, token level, and concept level. Moreover, our method allows for model intervention and offers both textual and visual explanations in terms of human-interpretable concepts. Experimental results on three skin image datasets demonstrate that our method, while preserving model interpretability, attains high performance and label efficiency for concept detection and disease diagnosis.
