Table of Contents
Fetching ...

A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

Lukas Burk, John Zobolas, Bernd Bischl, Andreas Bender, Marvin N. Wright, Raphael Sonabend

TL;DR

This work addresses the problem of identifying robust survival models for low-dimensional, right-censored data by performing the first large-scale, neutral benchmark across 32 real-world datasets and 18 models with tuning on discrimination and scoring rules. It systematically evaluates eight performance metrics and uses a rigorously neutral design to compare classical and machine-learning approaches. The findings show that Cox Proportional Hazards provides strong discrimination, while tuned Accelerated Failure Time models can offer improvements in overall predictive performance; many ML methods do not consistently beat Cox PH on key metrics. The study offers practical guidance for practitioners to start with Cox PH in standard settings and provides a reproducible benchmarking framework for future survival-model comparisons.

Abstract

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.

A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

TL;DR

This work addresses the problem of identifying robust survival models for low-dimensional, right-censored data by performing the first large-scale, neutral benchmark across 32 real-world datasets and 18 models with tuning on discrimination and scoring rules. It systematically evaluates eight performance metrics and uses a rigorously neutral design to compare classical and machine-learning approaches. The findings show that Cox Proportional Hazards provides strong discrimination, while tuned Accelerated Failure Time models can offer improvements in overall predictive performance; many ML methods do not consistently beat Cox PH on key metrics. The study offers practical guidance for practitioners to start with Cox PH in standard settings and provides a reproducible benchmarking framework for future survival-model comparisons.

Abstract

This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.
Paper Structure (36 sections, 15 figures, 4 tables)

This paper contains 36 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Critical difference plot comparing models with the CPH reference tuned on Harrell's C (a,b) and RCLL (c,d) and evaluated on Harrell's C (a), RCLL (c) and ISBS (b,d). Superior models (lower ranking scores) are on the left with decreasing performance (higher rank) moving right. Models connected by thick horizontal lines are not significantly different from the baseline when adjusting for multiple comparisons.
  • Figure 2: Boxplots of aggregated scores across all datasets for models tuned and evaluated with RCLL showing unmodified RCLL scores (a), Explained Residual Variation (ERV) scores (b), and scores scaled such that 0 is equivalent to KM and 1 is achieved by the best model for each dataset and measure.
  • Figure 3: Boxplots of raw evaluation scores using discrimination measures for tuning (Harrell's C) and using discrimination measures and ISBS for evaluation.
  • Figure 4: Boxplots of scaled evaluation scores using discrimination measures for tuning (Harrell's C) and using discrimination measures and ISBS for evaluation.
  • Figure 5: Boxplots of raw evaluation scores using scoring rules for tuning (RCLL) and evaluation.
  • ...and 10 more figures