Table of Contents
Fetching ...

Learning and Naming Subgroups with Exceptional Survival Characteristics

Mhd Jawad Al Rahwanji, Sascha Xu, Nils Philipp Walter, Jilles Vreeken

TL;DR

Sysurv is proposed, a fully differentiable, non-parametric method that leverages random survival forests to learn individual survival curves, automatically learns conditions and how to combine these into inherently interpretable rules, so as to select subgroups with exceptional survival characteristics.

Abstract

In many applications, it is important to identify subpopulations that survive longer or shorter than the rest of the population. In medicine, for example, it allows determining which patients benefit from treatment, and in predictive maintenance, which components are more likely to fail. Existing methods for discovering subgroups with exceptional survival characteristics require restrictive assumptions about the survival model (e.g. proportional hazards), pre-discretized features, and, as they compare average statistics, tend to overlook individual deviations. In this paper, we propose Sysurv, a fully differentiable, non-parametric method that leverages random survival forests to learn individual survival curves, automatically learns conditions and how to combine these into inherently interpretable rules, so as to select subgroups with exceptional survival characteristics. Empirical evaluation on a wide range of datasets and settings, including a case study on cancer data, shows that Sysurv reveals insightful and actionable survival subgroups.

Learning and Naming Subgroups with Exceptional Survival Characteristics

TL;DR

Sysurv is proposed, a fully differentiable, non-parametric method that leverages random survival forests to learn individual survival curves, automatically learns conditions and how to combine these into inherently interpretable rules, so as to select subgroups with exceptional survival characteristics.

Abstract

In many applications, it is important to identify subpopulations that survive longer or shorter than the rest of the population. In medicine, for example, it allows determining which patients benefit from treatment, and in predictive maintenance, which components are more likely to fail. Existing methods for discovering subgroups with exceptional survival characteristics require restrictive assumptions about the survival model (e.g. proportional hazards), pre-discretized features, and, as they compare average statistics, tend to overlook individual deviations. In this paper, we propose Sysurv, a fully differentiable, non-parametric method that leverages random survival forests to learn individual survival curves, automatically learns conditions and how to combine these into inherently interpretable rules, so as to select subgroups with exceptional survival characteristics. Empirical evaluation on a wide range of datasets and settings, including a case study on cancer data, shows that Sysurv reveals insightful and actionable survival subgroups.
Paper Structure (34 sections, 3 theorems, 16 equations, 7 figures, 4 tables, 5 algorithms)

This paper contains 34 sections, 3 theorems, 16 equations, 7 figures, 4 tables, 5 algorithms.

Key Result

Proposition 3.0

Given two groups $A$ and $B$, selectable by rule $\sigma_A$ and $\sigma_B$, resp., for which the expected group-level survival at any time $t$ are $\hat{S}_A(t)$ and $\hat{S}_B(t)$, and for which individual-level survival is denoted by $\hat{S}(t\mid\mathbf{x})$. The expected absolute difference in where $\ell^1_t(\cdot,\cdot)$ is an absolute difference measure, and for brevity we write $s_\circ$

Figures (7)

  • Figure 1: Survival subgroups. Sysurv finds and characterizes survival subgroups, i.e. subpopulations with exceptional survival characteristics compared to the overall population. (Left)Sysurv finds patients suffering from a therapy-resistant tumor. (Right)Sysurv finds people with exceptionally long (1) resp. short (2) durations until re-employment.
  • Figure 2: Desired level of sensitivity. (Left) Individual survival functions $\hat{S}(t|\mathbf{x})$ are more informative for discovery than their group-level estimate $\hat{S}(t)$. (Right) Approaches that do not apply the absolute value can capture the difference between group and reference population survival (Pop.) when the two do not cross ($\hat{S}_1$ and $\hat{S}_2$) but completely underestimate it when they do ($\hat{S}_3$).
  • Figure 3: Soft conditions and rules. The soft condition approaches a hard interval with decreasing temperature $\tau\to 0$(a). Multiple soft conditions combine to form a soft rule depicted as a hyper box in the covariate space (b). Adapted from xu:2024:syflow.
  • Figure 4: Synthetic setting. Comparison of Sysurv and each of RuleKit, EsmamDs, and Fibers in terms of F1-scores recovering planted subgroups with increasingly large datasets (Left), increasingly censored subjects (Center), and increasingly large planted subgroups (Right). EsmamDs is the closest competitor to Sysurv closely followed by RuleKit. Higher is better. The shaded areas show ±1 standard error over 10 runs.
  • Figure 5: Real-world setting. Survival subgroups discovered in the unemployment (a) and heart attack (b) datasets using Sysurv and RuleKit resp. EsmamDs. Sysurv learns more exceptional subsets of the subgroups discovered by baselines. The shaded areas show 95% confidence intervals.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Proposition 3.0
  • Corollary 3.1
  • proof
  • Proposition 1.0
  • proof