Table of Contents
Fetching ...

XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

Thuraya Alzubaidi, Sana Ammar, Maryam Alsharqi, Islem Rekik, Muzammil Behzad

TL;DR

XAI-CLIP integrates region-aware localization from vision-language representations with ROI-restricted perturbation methods to produce anatomically meaningful explanations for medical image segmentation. By constraining perturbations to clinically relevant regions and employing a segmentation-guided localization pipeline (including CoOp prompts and a U-Net backbone), it achieves clearer attribution maps with substantial runtime reductions. Empirical results on FLARE22 and CHAOS show improved dice and IoU for occlusion-based explanations and strong efficiency gains across LIME and RISE variants, supporting both interpretability and practicality in clinical settings. The framework advances explainable medical imaging by marrying semantic region priors with efficient perturbation-based XAI, facilitating transparent and deployable segmentation systems.

Abstract

Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.

XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

TL;DR

XAI-CLIP integrates region-aware localization from vision-language representations with ROI-restricted perturbation methods to produce anatomically meaningful explanations for medical image segmentation. By constraining perturbations to clinically relevant regions and employing a segmentation-guided localization pipeline (including CoOp prompts and a U-Net backbone), it achieves clearer attribution maps with substantial runtime reductions. Empirical results on FLARE22 and CHAOS show improved dice and IoU for occlusion-based explanations and strong efficiency gains across LIME and RISE variants, supporting both interpretability and practicality in clinical settings. The framework advances explainable medical imaging by marrying semantic region priors with efficient perturbation-based XAI, facilitating transparent and deployable segmentation systems.

Abstract

Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.
Paper Structure (28 sections, 11 equations, 13 figures, 3 tables)

This paper contains 28 sections, 11 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of the XAI-CLIP framework. Text and image inputs are encoded using vision and text encoders. A joint embedding space guides region-restricted perturbation-based XAI, producing anatomically aligned explainability maps.
  • Figure 2: Representative MRI slices from the FLARE22 dataset ma2023unleashing, showing images before preprocessing (top row) and after preprocessing (bottom row) across three samples.
  • Figure 3: LIME explanations using different superpixel methods: QuickShift, Felzenszwalb, and SLIC. Top: superpixels; Bottom: Our corresponding heatmaps showing influential regions.
  • Figure 4: RISE explanation: input with segmentation (Left) and our corresponding importance map (right) showing contribution of regions to the prediction.
  • Figure 5: Occlusion explanation: original input (left), segmentation mask (middle), and our importance heatmap (right) showing regions critical to the prediction.
  • ...and 8 more figures