Table of Contents
Fetching ...

Beyond Cox Models: Assessing the Performance of Machine-Learning Methods in Non-Proportional Hazards and Non-Linear Survival Analysis

Ivan Rossi, Flavio Sartori, Cesare Rollo, Giovanni Birolo, Piero Fariselli, Tiziana Sanavia

TL;DR

This study investigates survival analysis methods beyond the Cox proportional hazards framework, focusing on non-linear and non-PH models across synthetic and real datasets. It finds that while CoxPH remains strong when its assumptions hold, non-linear and non-PH approaches like SurvTRACE, GrBSA, and FastCPH can outperform under non-PH or non-linear conditions given adequate sample size. The authors emphasize using Antolini's concordance index instead of Harrell's for non-PH models and pairing it with the Brier score to assess calibration, highlighting the importance of evaluation metrics aligned with model behavior. The work also delivers SurvHive, a reproducible benchmarking framework, enabling researchers to compare methods and guide method selection for specific survival-analysis tasks.

Abstract

Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at https://github.com/compbiomed-unito/survhive.

Beyond Cox Models: Assessing the Performance of Machine-Learning Methods in Non-Proportional Hazards and Non-Linear Survival Analysis

TL;DR

This study investigates survival analysis methods beyond the Cox proportional hazards framework, focusing on non-linear and non-PH models across synthetic and real datasets. It finds that while CoxPH remains strong when its assumptions hold, non-linear and non-PH approaches like SurvTRACE, GrBSA, and FastCPH can outperform under non-PH or non-linear conditions given adequate sample size. The authors emphasize using Antolini's concordance index instead of Harrell's for non-PH models and pairing it with the Brier score to assess calibration, highlighting the importance of evaluation metrics aligned with model behavior. The work also delivers SurvHive, a reproducible benchmarking framework, enabling researchers to compare methods and guide method selection for specific survival-analysis tasks.

Abstract

Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell's concordance index (C-index) instead of more appropriate scores such as Antolini's concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini's C-index with Brier's score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at https://github.com/compbiomed-unito/survhive.

Paper Structure

This paper contains 17 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: C-index performance across all methods and datasets. Error bars are displayed for both Harrell's (orange bars) and Antolini's (blue bars) C-indices. Specifically, each dot represents the average C-index, while the bars show the C-index range resulting from the 3-fold cross-validation. Methods' assumptions are highlighted with different colors in the horizontal labels.
  • Figure 2: Brier score performance performance across all methods and datasets. Error bars are displayed for the rescaled Brier score (orange bars) and Antolini's C-index (blue bars). Specifically, each dot represents the average score value, while the bars show the range of each score resulting from the 3-fold cross-validation. The original Brier score $BS$ has been rescaled as $1 - 2BS$ to have the same range and direction of the C-index, allowing an easier visual comparison. Methods' assumptions are highlighted with different colors in the horizontal labels.
  • Figure 3: Performance of methods at different subsamples of the syntetic datasets. Y-axis displays the Antolini's C-index, while the x-axis shows the sample size.
  • Figure 4: Time elapsed for hyper-parameter optimization and training of each method. The error bars represent the time elapsed in the three real clinical datasets. The vertical axis scale is logarithmic.