Statistical inference after variable selection in Cox models: A simulation study
Lena Schemet, Sarah Friedrich-Welz
TL;DR
The paper tackles the challenge of valid inference after data-driven variable selection in Cox proportional hazards models with right censoring. It evaluates three post-selection strategies—sample splitting, exact conditional post-selection inference, and the debiased Lasso—across comprehensive simulations (including METABRIC-based realism) and a real-data example, focusing on selective coverage, interval width, and power. Findings show that methods explicitly accounting for selection (sample splitting and debiased Lasso) achieve near-nominal selective coverage and balanced power, while exact PSI can exhibit undercoverage or substantial interval inflation when tuning is data-driven; debiased Lasso often yields the shortest, most stable SCIs but with some variability across covariate types. The study offers practical guidance on tuning choices, censoring impact, and the importance of aligning inferential targets with the analytic goal, providing code for reproducibility. Overall, selective inference after Lasso in survival analysis can improve the credibility of post-selection conclusions, especially when tuned carefully and interpreted in the submodel context.
Abstract
Choosing relevant predictors is central to the analysis of biomedical time-to-event data. Classical frequentist inference, however, presumes that the set of covariates is fixed in advance and does not account for data-driven variable selection. As a consequence, naive post-selection inference may be biased and misleading. In right-censored survival settings, these issues may be further exacerbated by the additional uncertainty induced by censoring. We investigate several inference procedures applied after variable selection for the coefficients of the Lasso and its extension, the adaptive Lasso, in the context of the Cox model. The methods considered include sample splitting, exact post-selection inference, and the debiased Lasso. Their performance is examined in a neutral simulation study reflecting realistic covariate structures and censoring rates commonly encountered in biomedical applications. To complement the simulation results, we illustrate the practical behavior of these procedures in an applied example using a publicly available survival dataset.
