Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

Emanuele Massa; Anthony Coolen

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

Emanuele Massa, Anthony Coolen

TL;DR

This paper analyzes the Regularized Maximum Partial Likelihood Estimator (RMPLE) for high-dimensional Cox regression in the proportional regime where $p$ and $n$ grow proportionally. Using a replica method under replica symmetry and a Gaussian covariate design, it derives six coupled RS equations predicting the asymptotic behavior of the RMPLE, including its mean-squared error and support recovery characteristics. It then introduces COX-AMP to compute the RMPLE efficiently and shows how the RS order parameters can be inferred directly from data via the local field, without knowing the data-generating process, with CD offering an alternative data-driven path. The work demonstrates that RS predictions, including a fast cross-validation proxy for generalization error, are accurate in simulations and provides a practical route to apply these ideas to GLMs beyond Cox, while outlining limitations and avenues for future extensions to correlated covariates and more general survival models.

Abstract

We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates $\mathbf{X}_i\in\mathbb{R}^{p}$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$ for each $i=1, \dots, n$. In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

TL;DR

This paper analyzes the Regularized Maximum Partial Likelihood Estimator (RMPLE) for high-dimensional Cox regression in the proportional regime where

and

grow proportionally. Using a replica method under replica symmetry and a Gaussian covariate design, it derives six coupled RS equations predicting the asymptotic behavior of the RMPLE, including its mean-squared error and support recovery characteristics. It then introduces COX-AMP to compute the RMPLE efficiently and shows how the RS order parameters can be inferred directly from data via the local field, without knowing the data-generating process, with CD offering an alternative data-driven path. The work demonstrates that RS predictions, including a fast cross-validation proxy for generalization error, are accurate in simulations and provides a practical route to apply these ideas to GLMs beyond Cox, while outlining limitations and avenues for future extensions to correlated covariates and more general survival models.

Abstract

follow a multivariate Gaussian law with covariance

for each

. In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.

Paper Structure (34 sections, 216 equations, 4 figures, 5 algorithms)

This paper contains 34 sections, 216 equations, 4 figures, 5 algorithms.

Introduction
Sharp asymptotics in the proportional regime via the replica method
Estimation of the Replica Symmetric order parameters from the data in the proportional regime
Background on survival analysis and the Cox regression model
Proportional hazards models
Cox partial likelihood
Computing the Regularized Maximum Partial Likelihood Estimator (RMPLE)
Typical behaviour of the RMPLE
Interpretation
Application: approximate cross validation
Numerical solution of the RS equations with known data-generating process
Inferring the RS order parameters via Cox-AMP without solving the RS equations
Inferring the RS order parameters via Cox-CD without solving the RS equations
Conclusion
Replica derivation
...and 19 more sections

Figures (4)

Figure 1: Elbow plot of $\hat{\bbeta}_n(\alpha)$ (above) and Harrell's C-index (below) computed via Cox-AMP in red and Cox-CD in blue. The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio $\lambda = 0.75$. The L2 relative distance between $\hat{\bbeta}_n(\alpha)$ computed via Cox-AMP and Cox-CD is of the order of $10^{-6}$ for all value of the regularization parameter displayed. The coloured dots are the estimate of the test c - index defined alter in (\ref{['def : CV_C_index']}), obtained via the identity (\ref{['def : tilde_bxi']}), obtained by the replica symmetric theory (using the training set only).
Figure 2: Comparison between the numerical solution of the RS equations against their finite sample counter part $w_n := (\hat{\bbeta}_n'\bbeta_0/\|\bbeta_0\|)/\sqrt{p}$, $v_n := \sqrt{\|\hat{\bbeta}_n^2/p - w_n^2\|}$. Notice that here we use the knowledge of $\bbeta_0$ to compute $w_n, v_n$. Here $\zeta = 2.0, \nu = 0.005$ and $p = 2000$, i.e. $n = 1000$, $s =10$ .
Figure 3: Cox-AMP with elastic net regularization The estimators of $\hat{w}_{\star}, \hat{v}_{\star}, \hat{\tau}_{\star}, w_{\star}, v_{\star}, \tau_{\star}$ summarized by a black errorbar plot against the true values in red, along a regularization path (different values of $\alpha$). The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio is $\lambda = 0.75$.
Figure 4: Cox-CD with elastic net regularization The estimators of $\hat{w}_{\star}, \hat{v}_{\star}, \hat{\tau}_{\star}, w_{\star}, v_{\star}, \tau_{\star}$ summarized by a black errorbar plot against the true values in red, along a regularization path (different values of $\alpha$). The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio is $\lambda = 0.75$.

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

TL;DR

Abstract

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

Authors

TL;DR

Abstract

Table of Contents

Figures (4)