XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models
Thuraya Alzubaidi, Sana Ammar, Maryam Alsharqi, Islem Rekik, Muzammil Behzad
TL;DR
XAI-CLIP integrates region-aware localization from vision-language representations with ROI-restricted perturbation methods to produce anatomically meaningful explanations for medical image segmentation. By constraining perturbations to clinically relevant regions and employing a segmentation-guided localization pipeline (including CoOp prompts and a U-Net backbone), it achieves clearer attribution maps with substantial runtime reductions. Empirical results on FLARE22 and CHAOS show improved dice and IoU for occlusion-based explanations and strong efficiency gains across LIME and RISE variants, supporting both interpretability and practicality in clinical settings. The framework advances explainable medical imaging by marrying semantic region priors with efficient perturbation-based XAI, facilitating transparent and deployable segmentation systems.
Abstract
Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.
