Table of Contents
Fetching ...

Statistical inference after variable selection in Cox models: A simulation study

Lena Schemet, Sarah Friedrich-Welz

TL;DR

The paper tackles the challenge of valid inference after data-driven variable selection in Cox proportional hazards models with right censoring. It evaluates three post-selection strategies—sample splitting, exact conditional post-selection inference, and the debiased Lasso—across comprehensive simulations (including METABRIC-based realism) and a real-data example, focusing on selective coverage, interval width, and power. Findings show that methods explicitly accounting for selection (sample splitting and debiased Lasso) achieve near-nominal selective coverage and balanced power, while exact PSI can exhibit undercoverage or substantial interval inflation when tuning is data-driven; debiased Lasso often yields the shortest, most stable SCIs but with some variability across covariate types. The study offers practical guidance on tuning choices, censoring impact, and the importance of aligning inferential targets with the analytic goal, providing code for reproducibility. Overall, selective inference after Lasso in survival analysis can improve the credibility of post-selection conclusions, especially when tuned carefully and interpreted in the submodel context.

Abstract

Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.

Statistical inference after variable selection in Cox models: A simulation study

TL;DR

The paper tackles the challenge of valid inference after data-driven variable selection in Cox proportional hazards models with right censoring. It evaluates three post-selection strategies—sample splitting, exact conditional post-selection inference, and the debiased Lasso—across comprehensive simulations (including METABRIC-based realism) and a real-data example, focusing on selective coverage, interval width, and power. Findings show that methods explicitly accounting for selection (sample splitting and debiased Lasso) achieve near-nominal selective coverage and balanced power, while exact PSI can exhibit undercoverage or substantial interval inflation when tuning is data-driven; debiased Lasso often yields the shortest, most stable SCIs but with some variability across covariate types. The study offers practical guidance on tuning choices, censoring impact, and the importance of aligning inferential targets with the analytic goal, providing code for reproducibility. Overall, selective inference after Lasso in survival analysis can improve the credibility of post-selection conclusions, especially when tuned carefully and interpreted in the submodel context.

Abstract

Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.
Paper Structure (36 sections, 11 equations, 4 figures, 4 tables)

This paper contains 36 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Selective coverage under the realistic coefficient pattern with $p=20$, Weibull baseline hazard, no censoring, and correlation $\rho=0.3$. Results are shown for coefficient $X_1$ using the non-adaptive Lasso with tuning choices $\lambda_{\mathrm{CV,min}}$, $\lambda_{\mathrm{CV,1se}}$,$\lambda_{\mathrm{AIC}}$.
  • Figure 2: Distribution of SCI lengths for the primary estimand at sample size $n=200$ across inference methods. Results are shown for the toy and METABRIC-calibrated settings on a log scale.
  • Figure 3: Selective power (top row) and selective type I error rates (bottom row) for the toy and METABRIC settings at sample size $n=75$. The dashed horizontal line indicates the nominal type I error level.
  • Figure 4: Real data example METABRIC: point estimates and 90% selective confidence intervals for regression coefficients obtained with different inference methods. Results are shown for the cross-validated tuning choice $\lambda_{\mathrm{CV,min}}$. Coefficients are displayed on the original scale and ordered by increasing standardized effect size. Numbers above the panels indicate selection frequencies (in %) across 100 subsamples.