Interpretable Non-linear Survival Analysis with Evolutionary Symbolic Regression
Luigi Rovito, Marco Virgolin
TL;DR
This paper tackles the interpretability-accuracy trade-off in survival regression (SuR) by introducing an evolutionary Symbolic Regression (SR) framework tailored to SuR. The authors develop a GP-based, multi-expression SR algorithm that fits a Cox-like hazard model $h(t,\mathbf{x})=h_0(t)\exp(\boldsymbol{\theta}^\top \mathbf{f}(\mathbf{x}))$, where $\mathbf{f}(\mathbf{x})$ comprises evolved, non-linear expressions and model dimensionality is controlled to preserve interpretability. They benchmark SR against traditional glass-box methods (Cox with elastic net, Survival Trees) and black-box models (Gradient Boosting, Random Forest) across five real-world datasets, using Pareto-fronts, hyper-volume, and concordance index with IPCW, demonstrating SR often outperforms glass-box approaches and is competitive with black-box models. Qualitative analyses show SR can yield compact, partially interpretable expressions that combine linear and non-linear terms, suggesting viable interpretability with limited complexity, though some numerical protections used to ensure stability may hamper readability. Overall, the work positions SR as a promising direction for interpretable, non-linear SuR, with future work addressing overfitting, regularization, and fully interpretable expression designs.
Abstract
Survival Regression (SuR) is a key technique for modeling time to event in important applications such as clinical trials and semiconductor manufacturing. Currently, SuR algorithms belong to one of three classes: non-linear black-box -- allowing adaptability to many datasets but offering limited interpretability (e.g., tree ensembles); linear glass-box -- being easier to interpret but limited to modeling only linear interactions (e.g., Cox proportional hazards); and non-linear glass-box -- allowing adaptability and interpretability, but empirically found to have several limitations (e.g., explainable boosting machines, survival trees). In this work, we investigate whether Symbolic Regression (SR), i.e., the automated search of mathematical expressions from data, can lead to non-linear glass-box survival models that are interpretable and accurate. We propose an evolutionary, multi-objective, and multi-expression implementation of SR adapted to SuR. Our empirical results on five real-world datasets show that SR consistently outperforms traditional glass-box methods for SuR in terms of accuracy per number of dimensions in the model, while exhibiting comparable accuracy with black-box methods. Furthermore, we offer qualitative examples to assess the interpretability potential of SR models for SuR. Code at: https://github.com/lurovi/SurvivalMultiTree-pyNSGP.
