Table of Contents
Fetching ...

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

Emanuele Massa, Anthony Coolen

TL;DR

This paper analyzes the Regularized Maximum Partial Likelihood Estimator (RMPLE) for high-dimensional Cox regression in the proportional regime where $p$ and $n$ grow proportionally. Using a replica method under replica symmetry and a Gaussian covariate design, it derives six coupled RS equations predicting the asymptotic behavior of the RMPLE, including its mean-squared error and support recovery characteristics. It then introduces COX-AMP to compute the RMPLE efficiently and shows how the RS order parameters can be inferred directly from data via the local field, without knowing the data-generating process, with CD offering an alternative data-driven path. The work demonstrates that RS predictions, including a fast cross-validation proxy for generalization error, are accurate in simulations and provides a practical route to apply these ideas to GLMs beyond Cox, while outlining limitations and avenues for future extensions to correlated covariates and more general survival models.

Abstract

We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates $\mathbf{X}_i\in\mathbb{R}^{p}$ follow a multivariate Gaussian law with covariance $\mathbf{I}_p/p$ for each $i=1, \dots, n$. In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.

Observable asymptotics of regularized Cox regression models with standard Gaussian designs: a statistical mechanics approach

TL;DR

This paper analyzes the Regularized Maximum Partial Likelihood Estimator (RMPLE) for high-dimensional Cox regression in the proportional regime where and grow proportionally. Using a replica method under replica symmetry and a Gaussian covariate design, it derives six coupled RS equations predicting the asymptotic behavior of the RMPLE, including its mean-squared error and support recovery characteristics. It then introduces COX-AMP to compute the RMPLE efficiently and shows how the RS order parameters can be inferred directly from data via the local field, without knowing the data-generating process, with CD offering an alternative data-driven path. The work demonstrates that RS predictions, including a fast cross-validation proxy for generalization error, are accurate in simulations and provides a practical route to apply these ideas to GLMs beyond Cox, while outlining limitations and avenues for future extensions to correlated covariates and more general survival models.

Abstract

We study the asymptotic behaviour of the Regularized Maximum Partial Likelihood Estimator (RMPLE) in the proportional limit, considering an arbitrary convex regularizer and assuming that the covariates follow a multivariate Gaussian law with covariance for each . In order to efficiently compute the estimator under investigation, we propose a modified Approximate Message Passing (AMP) algorithm, that we name COX-AMP, and compare its performance with the Coordinate-wise Descent (CD) algorithm, which is taken as reference. By means of the Replica method, we derive a set of six Replica Symmetric (RS) equations that we show to correctly describe the average behaviour of the estimators when the sample size and the number of covariates is large and commensurate. These equations cannot be solved in practice, as the data generating process (that we are trying to estimate) is not known. However, the update equations of COX-AMP suggest the construction of a local field that can in turn be used to accurately estimate all the RS order parameters of the theory \emph{solely from the data}, \emph{without} actually solving the RS equations. We emphasize that this approach can be applied when the estimator is computed via any method and is not restricted to COX-AMP. Once the RS order parameters are estimated, we have access to the amount of signal and noise in the RMPLE, but also its generalization error, directly from the data. Although we focus on the Partial Likelihood objective, we envisage broader application of the methodology proposed here, for instance to GLMs with nuisance parameters, which include some non-proportional hazards models, e.g. Accelerated Failure Time models.
Paper Structure (34 sections, 216 equations, 4 figures, 5 algorithms)

This paper contains 34 sections, 216 equations, 4 figures, 5 algorithms.

Figures (4)

  • Figure 1: Elbow plot of $\hat{\bbeta}_n(\alpha)$ (above) and Harrell's C-index (below) computed via Cox-AMP in red and Cox-CD in blue. The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio $\lambda = 0.75$. The L2 relative distance between $\hat{\bbeta}_n(\alpha)$ computed via Cox-AMP and Cox-CD is of the order of $10^{-6}$ for all value of the regularization parameter displayed. The coloured dots are the estimate of the test c - index defined alter in (\ref{['def : CV_C_index']}), obtained via the identity (\ref{['def : tilde_bxi']}), obtained by the replica symmetric theory (using the training set only).
  • Figure 2: Comparison between the numerical solution of the RS equations against their finite sample counter part $w_n := (\hat{\bbeta}_n'\bbeta_0/\|\bbeta_0\|)/\sqrt{p}$, $v_n := \sqrt{\|\hat{\bbeta}_n^2/p - w_n^2\|}$. Notice that here we use the knowledge of $\bbeta_0$ to compute $w_n, v_n$. Here $\zeta = 2.0, \nu = 0.005$ and $p = 2000$, i.e. $n = 1000$, $s =10$ .
  • Figure 3: Cox-AMP with elastic net regularization The estimators of $\hat{w}_{\star}, \hat{v}_{\star}, \hat{\tau}_{\star}, w_{\star}, v_{\star}, \tau_{\star}$ summarized by a black errorbar plot against the true values in red, along a regularization path (different values of $\alpha$). The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio is $\lambda = 0.75$.
  • Figure 4: Cox-CD with elastic net regularization The estimators of $\hat{w}_{\star}, \hat{v}_{\star}, \hat{\tau}_{\star}, w_{\star}, v_{\star}, \tau_{\star}$ summarized by a black errorbar plot against the true values in red, along a regularization path (different values of $\alpha$). The data are generated from a Log-logistic proportional hazards model $\Lambda_0(t) = \log(1+t^2/2)$ with uniform censoring between $\tau_1 = 1.0$ and $\tau_2 = 2.0$. The associations $\bbeta_0$ are sampled as from a Gauss-Bernoulli distribution $\mathcal{P}_{\beta_0}(x) = (1-\nu) \delta(x) + \nu \exp\{-x^2/2\sigma^2\} / \sqrt{2\pi}\sigma$, with sparsity $\nu = 0.005$ and $\sigma$ adjusted to have $\|\bbeta_0\|^2_2/p = \theta^2_0 = 1.0$. The L1 ratio is $\lambda = 0.75$.