Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview
Ahmad Chaddad, Yan Hu, Yihang Wu, Binbin Wen, Reem Kateb
TL;DR
This paper surveys the role of generalizability and explainability in deep learning for medical image analysis, arguing that clinical deployment requires transparent and robust models. It implements four CNN backbones across three public datasets and evaluates five local XAI methods using the ROAD metric, supplemented by paired t-tests and timing analysis to assess both accuracy and explainability efficiency. The findings indicate that XGradCAM and AblationCAM often provide clearer localization of pathological regions and higher confidence gains in several tasks, while methods like EigenGradCAM may underperform in complex skin-cancer cases; LayerCAM and XGradCAM offer favorable trade-offs between speed and interpretability. The work highlights practical implications for clinical adoption and suggests avenues such as hybrid XAI techniques and more robust, diverse benchmarking to advance reliable, generalizable medical imaging solutions.
Abstract
Objective. This paper presents an overview of generalizable and explainable artificial intelligence (XAI) in deep learning (DL) for medical imaging, aimed at addressing the urgent need for transparency and explainability in clinical applications. Methodology. We propose to use four CNNs in three medical datasets (brain tumor, skin cancer, and chest x-ray) for medical image classification tasks. In addition, we perform paired t-tests to show the significance of the differences observed between different methods. Furthermore, we propose to combine ResNet50 with five common XAI techniques to obtain explainable results for model prediction, aiming at improving model transparency. We also involve a quantitative metric (confidence increase) to evaluate the usefulness of XAI techniques. Key findings. The experimental results indicate that ResNet50 can achieve feasible accuracy and F1 score in all datasets (e.g., 86.31\% accuracy in skin cancer). Furthermore, the findings show that while certain XAI methods, such as XgradCAM, effectively highlight relevant abnormal regions in medical images, others, like EigenGradCAM, may perform less effectively in specific scenarios. In addition, XgradCAM indicates higher confidence increase (e.g., 0.12 in glioma tumor) compared to GradCAM++ (0.09) and LayerCAM (0.08). Implications. Based on the experimental results and recent advancements, we outline future research directions to enhance the robustness and generalizability of DL models in the field of biomedical imaging.
