Table of Contents
Fetching ...

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Kyungjin Kim, Youngro Lee, Jongmo Seo

TL;DR

CLE-SH tackles the lack of statistical validation in SHAP analyses for biomedical tabular data by delivering an automated, statistically driven pipeline that unifies feature selection, feature typing, univariate analysis, and interaction analysis with printable reports. The method combines traditional significance testing with regression-based function fitting (linear, quadratic, sigmoid) and interaction-aware statistics to produce interpretable, sentence-based summaries, enhancing rigor and accessibility in early-stage analyses. Key contributions include a lightweight, end-to-end library, explicit hyperparameters for significance control, and ready-to-report outputs that facilitate collaboration between clinicians and data scientists. The work demonstrates practical impact by enabling more reliable SHAP interpretations, potentially improving biomarker discovery and result validation in biomedical research.

Abstract

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

TL;DR

CLE-SH tackles the lack of statistical validation in SHAP analyses for biomedical tabular data by delivering an automated, statistically driven pipeline that unifies feature selection, feature typing, univariate analysis, and interaction analysis with printable reports. The method combines traditional significance testing with regression-based function fitting (linear, quadratic, sigmoid) and interaction-aware statistics to produce interpretable, sentence-based summaries, enhancing rigor and accessibility in early-stage analyses. Key contributions include a lightweight, end-to-end library, explicit hyperparameters for significance control, and ready-to-report outputs that facilitate collaboration between clinicians and data scientists. The work demonstrates practical impact by enabling more reliable SHAP interpretations, potentially improving biomarker discovery and result validation in biomedical research.

Abstract

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.
Paper Structure (28 sections, 3 equations, 7 figures, 2 tables)

This paper contains 28 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Investigation on the SHAP analysis usage in recent papers in medical fields. All figures were generated based on the search results available at the time of submission. One article (item 7 in Appendix Table 1) was retracted after submission. The figures and corresponding statistics were retained to preserve transparency of the initial dataset, but this retracted article had no influence on the study’s conclusions.
  • Figure 2: Schematic of the library
  • Figure 3: Graphical explanation of determining the number of important features using SHAP with examples. a-c. Visual explanation of the feature selection process. d. Feature selection plot derived from the BC dataset and e) from the DR dataset.
  • Figure 4: Flowchart of the univariate SHAP value analysis pipeline. This flow outlines the analysis procedure by feature type (binary, discrete, and continuous), showing how SHAP values are statistically processed and interpreted using hypothesis tests and curve fitting.
  • Figure 5: Examples in univariate analysis. a. Binary type (Dataset: MS), b. Discrete type (Dataset: HF), c. Result of Tukey-HSD test (Feature from Figure4b), d. Continuous type fitted to a linear function (Dataset: BC), e. Continuous type fitted to a quadratic function (Dataset: DR), f. Continuous type fitted to a sigmoid function (Dataset: MS), g. Continuous type fitted to a sigmoid function (Dataset: IBD). Other examples can be found in Appendix 4
  • ...and 2 more figures