Table of Contents
Fetching ...

FastSurvival: Hidden Computational Blessings in Training Cox Proportional Hazards Models

Jiachang Liu, Rui Zhang, Cynthia Rudin

TL;DR

This work proposes new optimization methods by constructing and minimizing surrogate functions that exploit hidden mathematical structures of the Cox proportional hazards model, and shows how these methods can be used to solve the cardinality-constrained CPH problem.

Abstract

Survival analysis is an important research topic with applications in healthcare, business, and manufacturing. One essential tool in this area is the Cox proportional hazards (CPH) model, which is widely used for its interpretability, flexibility, and predictive performance. However, for modern data science challenges such as high dimensionality (both $n$ and $p$) and high feature correlations, current algorithms to train the CPH model have drawbacks, preventing us from using the CPH model at its full potential. The root cause is that the current algorithms, based on the Newton method, have trouble converging due to vanishing second order derivatives when outside the local region of the minimizer. To circumvent this problem, we propose new optimization methods by constructing and minimizing surrogate functions that exploit hidden mathematical structures of the CPH model. Our new methods are easy to implement and ensure monotonic loss decrease and global convergence. Empirically, we verify the computational efficiency of our methods. As a direct application, we show how our optimization methods can be used to solve the cardinality-constrained CPH problem, producing very sparse high-quality models that were not previously practical to construct. We list several extensions that our breakthrough enables, including optimization opportunities, theoretical questions on CPH's mathematical structure, as well as other CPH-related applications.

FastSurvival: Hidden Computational Blessings in Training Cox Proportional Hazards Models

TL;DR

This work proposes new optimization methods by constructing and minimizing surrogate functions that exploit hidden mathematical structures of the Cox proportional hazards model, and shows how these methods can be used to solve the cardinality-constrained CPH problem.

Abstract

Survival analysis is an important research topic with applications in healthcare, business, and manufacturing. One essential tool in this area is the Cox proportional hazards (CPH) model, which is widely used for its interpretability, flexibility, and predictive performance. However, for modern data science challenges such as high dimensionality (both and ) and high feature correlations, current algorithms to train the CPH model have drawbacks, preventing us from using the CPH model at its full potential. The root cause is that the current algorithms, based on the Newton method, have trouble converging due to vanishing second order derivatives when outside the local region of the minimizer. To circumvent this problem, we propose new optimization methods by constructing and minimizing surrogate functions that exploit hidden mathematical structures of the CPH model. Our new methods are easy to implement and ensure monotonic loss decrease and global convergence. Empirically, we verify the computational efficiency of our methods. As a direct application, we show how our optimization methods can be used to solve the cardinality-constrained CPH problem, producing very sparse high-quality models that were not previously practical to construct. We list several extensions that our breakthrough enables, including optimization opportunities, theoretical questions on CPH's mathematical structure, as well as other CPH-related applications.

Paper Structure

This paper contains 55 sections, 4 theorems, 72 equations, 35 figures, 1 table.

Key Result

Theorem 3.1

For the CPH loss function defined in Equation eq:CPH_neg_log_partial_likelihood_definition, the first, second, and third order partial derivatives with respect to coordinate $l$ are: 1st order partial derivative: 2nd order partial derivative: 3rd order partial derivative:

Figures (35)

  • Figure 1: Efficiency experiments on the first fold of the Flchain dataset. a) The left two plots are on the $\ell_2$-regularized problem with $\lambda_2 = 1$. For all Newton-type methods, the losses blow up when regularization is weak. In contrast, our methods (quadratic and cubic surrogates) ensure monotonic decrease of losses. b) The right two plots are on the $\ell_1 + \ell_2$--regularized problem with $\lambda_1=1$ and $\lambda_2=5$. The exact Newton method cannot be directly applied, so we compare only with quasi Newton simon2011regularization and proximal Newton moufad2023skglm methods, which have losses that increase at the beginning. Our methods are significantly faster than both baselines. Because the evaluation cost per iteration is very cheap for our methods, we are significantly faster in terms of wall clock time (see the difference between the third and fourth plots). See Appendix \ref{['appendix_sec:additional_results']} for results on other datasets.
  • Figure 2: Variable selection on synthetic datasets with high correlation (correlation level $\rho = 0.9$). From left to right, the sample sizes are 1200, 1000, and 800, respectively. The F1 score (the higher the better) is closely related to the support recovery rate. On the left two plots, we can see our method recovers the true variables significantly better than other methods ($100\%$ recovery rate on the left plot; true support size is 15). As the sample size decreases, the F1 score decreases for all methods.
  • Figure 3: Variable selection on the Employee Attrition dataset. We show support size vs$.$ CIndex (left two plots, the higher the better) and support size vs$.$ IBS score (right two plots, the lower the better). We compare our method with Cox-based sparse learning methods. For both metrics, our method is significantly better than other baselines.
  • Figure 4: Variable selection on the Dialysis dataset. We show support size vs$.$ CIndex (left two plots, the higher the better) and support size vs$.$ IBS score (right two plots, the lower the better). We compare our method with other model classes. For both metrics, our method obtains solutions that are significantly sparser than other model classes without losing accuracy on the test sets. Other model classes are prone to overfitting on the training sets.
  • Figure 5: Optimization on the Flchain dataset with $\lambda_1 = 0$ and $\lambda_2 = 1.0$. The baselines (exact Newton, quasi Newton, and proximal Newton) all have the losses blow up. In contrast, our methods based on the quadratic and cubic surrogate functions have the losses monotonically decreasing.
  • ...and 30 more figures

Theorems & Definitions (9)

  • Theorem 3.1
  • Lemma 3.2
  • Corollary 3.3
  • Theorem 3.4
  • proof
  • proof
  • proof
  • proof
  • proof