Table of Contents
Fetching ...

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

Justin Lin, Julia Fukuyama

TL;DR

This work addresses the challenge of interpreting black box predictions in heterogeneous diseases by tying model explanations to original features through SHAP values and then clustering samples by their SHAP profiles. It integrates a five-step supervised clustering workflow that combines XGBoost predictions with SHAP explanations, visualizations via UMAP, and clustering with HDBSCAN, culminating in a generalized high-dimensional waterfall plot for multi-class SHAP vectors. The authors demonstrate the approach on a controlled simulated dataset and a real-world Alzheimer's disease dataset from ADNI, revealing subgroups within disease statuses and distinct predictive pathways, including APOE4 related substructure and key SHAP drivers such as CDRSB, LDELTOTAL, mPACCdigit, and MMSE. The results highlight the method’s potential to uncover disease heterogeneity and support precision medicine by mapping actionable archetypes to prediction pathways, with broad implications forExplainable AI in clinical settings.

Abstract

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

TL;DR

This work addresses the challenge of interpreting black box predictions in heterogeneous diseases by tying model explanations to original features through SHAP values and then clustering samples by their SHAP profiles. It integrates a five-step supervised clustering workflow that combines XGBoost predictions with SHAP explanations, visualizations via UMAP, and clustering with HDBSCAN, culminating in a generalized high-dimensional waterfall plot for multi-class SHAP vectors. The authors demonstrate the approach on a controlled simulated dataset and a real-world Alzheimer's disease dataset from ADNI, revealing subgroups within disease statuses and distinct predictive pathways, including APOE4 related substructure and key SHAP drivers such as CDRSB, LDELTOTAL, mPACCdigit, and MMSE. The results highlight the method’s potential to uncover disease heterogeneity and support precision medicine by mapping actionable archetypes to prediction pathways, with broad implications forExplainable AI in clinical settings.

Abstract

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

Paper Structure

This paper contains 20 sections, 8 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Supervised clustering workflow. The methods we used within in each step are parenthesized.
  • Figure 2: Raw data embedded in two dimensions with UMAP and colored according to target class.
  • Figure 3: Absolute response function coefficients and average absolute SHAP values.
  • Figure 4: HDBSCAN clustering of the SHAP values (left). Raw values colored according to the same clustering (right).
  • Figure 5: Waterfall plot of top SHAP values averaged across clusters.
  • ...and 10 more figures