CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity
Kyungjin Kim, Youngro Lee, Jongmo Seo
TL;DR
CLE-SH tackles the lack of statistical validation in SHAP analyses for biomedical tabular data by delivering an automated, statistically driven pipeline that unifies feature selection, feature typing, univariate analysis, and interaction analysis with printable reports. The method combines traditional significance testing with regression-based function fitting (linear, quadratic, sigmoid) and interaction-aware statistics to produce interpretable, sentence-based summaries, enhancing rigor and accessibility in early-stage analyses. Key contributions include a lightweight, end-to-end library, explicit hyperparameters for significance control, and ready-to-report outputs that facilitate collaboration between clinicians and data scientists. The work demonstrates practical impact by enabling more reliable SHAP interpretations, potentially improving biomarker discovery and result validation in biomedical research.
Abstract
Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.
