Table of Contents
Fetching ...

How important are the genes to explain the outcome - the asymmetric Shapley value as an honest importance metric for high-dimensional features

Mark A. van de Wiel, Jeroen Goedhart, Martin Jullum, Kjersti Aas

TL;DR

Asymmetric Shapley values are suggested to use as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model, focusing on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown.

Abstract

In clinical prediction settings the importance of a high-dimensional feature like genomics is often assessed by evaluating the change in predictive performance when adding it to a set of traditional clinical variables. This approach is questionable, because it does not account for collinearity nor known directionality of dependencies between variables. We suggest to use asymmetric Shapley values as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model. We focus on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown. We derive efficient algorithms to compute local and global asymmetric Shapley values for this setting. The former are shown to be very useful for inference, whereas the latter provide interpretation by decomposing any predictive performance metric into contributions of the features. Throughout, we illustrate our framework by a leading example: the prediction of progression-free survival for colorectal cancer patients.

How important are the genes to explain the outcome - the asymmetric Shapley value as an honest importance metric for high-dimensional features

TL;DR

Asymmetric Shapley values are suggested to use as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model, focusing on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown.

Abstract

In clinical prediction settings the importance of a high-dimensional feature like genomics is often assessed by evaluating the change in predictive performance when adding it to a set of traditional clinical variables. This approach is questionable, because it does not account for collinearity nor known directionality of dependencies between variables. We suggest to use asymmetric Shapley values as a more suitable alternative to quantify feature importance in the context of a mixed-dimensional prediction model. We focus on a setting that is particularly relevant in clinical prediction: disease state as a mediating variable for genomic effects, with additional confounders for which the direction of effects may be unknown. We derive efficient algorithms to compute local and global asymmetric Shapley values for this setting. The former are shown to be very useful for inference, whereas the latter provide interpretation by decomposing any predictive performance metric into contributions of the features. Throughout, we illustrate our framework by a leading example: the prediction of progression-free survival for colorectal cancer patients.
Paper Structure (26 sections, 27 equations, 6 figures, 8 tables)

This paper contains 26 sections, 27 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Model for information flow for $G$: gene expression profile, $C$: confounders, $D$: disease state, $Y$: outcome. Right arrows depict unidirectional effects, bidirectional arrows depict correlations. Grey arrows indicate prediction, black arrows dependency between variables. Modules may contain multiple features.
  • Figure 2: Mediation effect. Shapley values for Genes $G$ (y-axis) against true disease state $D$ (x-axis) for the asymmetric (left box) and symmetric (right box) version.
  • Figure 3: Analytical Shapley values for variables $G$ (bottom panels) and $D$ (top panels) as a function of the variable values. Both the symmetric (black) and the asymmetric (orange) versions are depicted. Results are shown for varying correlation strengths $\rho$. Remaining parameter settings: $\beta_1=\beta_2=1.0$, $\beta_3=5.0$, and $\gamma=0.8$. The plot is produced using the R-package https://cran.r-project.org/web/packages/ggplot2/index.htmlggplot2.
  • Figure 4: Shapley scatter plots for 200 test observations for the low-dimensional simulation. Variable on the x-axis, its Shapley value on y-axis. All Shapley values are estimated using SHAP, i.e. marginalisation.
  • Figure 5: Asymmetric Shapley values based on refitting. Scatter plots for 200 test observations for the low-dimensional simulation. Variable on the x-axis, its Shapley value on y-axis.
  • ...and 1 more figures