CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Kyungjin Kim; Youngro Lee; Jongmo Seo

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Kyungjin Kim, Youngro Lee, Jongmo Seo

TL;DR

CLE-SH tackles the lack of statistical validation in SHAP analyses for biomedical tabular data by delivering an automated, statistically driven pipeline that unifies feature selection, feature typing, univariate analysis, and interaction analysis with printable reports. The method combines traditional significance testing with regression-based function fitting (linear, quadratic, sigmoid) and interaction-aware statistics to produce interpretable, sentence-based summaries, enhancing rigor and accessibility in early-stage analyses. Key contributions include a lightweight, end-to-end library, explicit hyperparameters for significance control, and ready-to-report outputs that facilitate collaboration between clinicians and data scientists. The work demonstrates practical impact by enabling more reliable SHAP interpretations, potentially improving biomarker discovery and result validation in biomedical research.

Abstract

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

TL;DR

Abstract

Paper Structure (28 sections, 3 equations, 7 figures, 2 tables)

This paper contains 28 sections, 3 equations, 7 figures, 2 tables.

Introduction
Background
Problem Definition
Related Works
SHAP Values for Feature Selection
Analysis on SHAP Values
Simpler Explanation of SHAP Values
Proposed Approach
Methods
Datasets
Statistics
Sign of SHAP Values
SHAP Values Between Two Groups
SHAP Values Between Multiple Groups
Regression Analysis
...and 13 more sections

Figures (7)

Figure 1: Investigation on the SHAP analysis usage in recent papers in medical fields. All figures were generated based on the search results available at the time of submission. One article (item 7 in Appendix Table 1) was retracted after submission. The figures and corresponding statistics were retained to preserve transparency of the initial dataset, but this retracted article had no influence on the study’s conclusions.
Figure 2: Schematic of the library
Figure 3: Graphical explanation of determining the number of important features using SHAP with examples. a-c. Visual explanation of the feature selection process. d. Feature selection plot derived from the BC dataset and e) from the DR dataset.
Figure 4: Flowchart of the univariate SHAP value analysis pipeline. This flow outlines the analysis procedure by feature type (binary, discrete, and continuous), showing how SHAP values are statistically processed and interpreted using hypothesis tests and curve fitting.
Figure 5: Examples in univariate analysis. a. Binary type (Dataset: MS), b. Discrete type (Dataset: HF), c. Result of Tukey-HSD test (Feature from Figure4b), d. Continuous type fitted to a linear function (Dataset: BC), e. Continuous type fitted to a quadratic function (Dataset: DR), f. Continuous type fitted to a sigmoid function (Dataset: MS), g. Continuous type fitted to a sigmoid function (Dataset: IBD). Other examples can be found in Appendix 4
...and 2 more figures

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

TL;DR

Abstract

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Authors

TL;DR

Abstract

Table of Contents

Figures (7)