Table of Contents
Fetching ...

Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics

Tianyi Ren, Daniel Low, Pittra Jaengprajak, Juampablo Heras Rivera, Jacob Ruzevick, Mehmet Kurt

TL;DR

The paper addresses the opacity of deep learning-based brain tumor segmentation by introducing contrast-level Shapley values to explain multi-contrast MRI predictions. It defines two clinically meaningful metrics—agreement with clinician imaging rankings and cross-fold uncertainty—to translate explanations into reliability cues, and evaluates them on the BraTS 2024 GoAT dataset across four architectures. Findings show that better Dice performance coincides with greater agreement between Shapley-based rankings and clinical protocols, while higher ranking variance (uncertainty) associates with worse performance, providing a pathway for clinically interpretable reliability assessment. The work offers a practical framework to bridge black-box segmentation models with clinical decision-making, enhancing trust and potential adoption in clinical workflows.

Abstract

Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: $r=-0.581$). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.

Clinical Interpretability of Deep Learning Segmentation Through Shapley-Derived Agreement and Uncertainty Metrics

TL;DR

The paper addresses the opacity of deep learning-based brain tumor segmentation by introducing contrast-level Shapley values to explain multi-contrast MRI predictions. It defines two clinically meaningful metrics—agreement with clinician imaging rankings and cross-fold uncertainty—to translate explanations into reliability cues, and evaluates them on the BraTS 2024 GoAT dataset across four architectures. Findings show that better Dice performance coincides with greater agreement between Shapley-based rankings and clinical protocols, while higher ranking variance (uncertainty) associates with worse performance, providing a pathway for clinically interpretable reliability assessment. The work offers a practical framework to bridge black-box segmentation models with clinical decision-making, enhancing trust and potential adoption in clinical workflows.

Abstract

Segmentation is the identification of anatomical regions of interest, such as organs, tissue, and lesions, serving as a fundamental task in computer-aided diagnosis in medical imaging. Although deep learning models have achieved remarkable performance in medical image segmentation, the need for explainability remains critical for ensuring their acceptance and integration in clinical practice, despite the growing research attention in this area. Our approach explored the use of contrast-level Shapley values, a systematic perturbation of model inputs to assess feature importance. While other studies have investigated gradient-based techniques through identifying influential regions in imaging inputs, Shapley values offer a broader, clinically aligned approach, explaining how model performance is fairly attributed to certain imaging contrasts over others. Using the BraTS 2024 dataset, we generated rankings for Shapley values for four MRI contrasts across four model architectures. Two metrics were proposed from the Shapley ranking: agreement between model and ``clinician" imaging ranking, and uncertainty quantified through Shapley ranking variance across cross-validation folds. Higher-performing cases (Dice \textgreater0.6) showed significantly greater agreement with clinical rankings. Increased Shapley ranking variance correlated with decreased performance (U-Net: ). These metrics provide clinically interpretable proxies for model reliability, helping clinicians better understand state-of-the-art segmentation models.

Paper Structure

This paper contains 7 sections, 4 equations, 3 figures.

Figures (3)

  • Figure 1: Overview of explainability metrics derived from model generated Shapley values.
  • Figure 2: Comparison of Normalized Spearman Footrule (agreement) between increasing Dice score groups. Colors indicate agreement between U-Net Shapley ranking and annotator consensus (blue) and clinical standard (green)
  • Figure 3: Dice score performance correlated against Shapley rank variance ($V$, see 2.4) for four models. Subcohort Spearman regression values listed above (red) and below (blue) threshold at 0.275 Shapley rank variance.