Right-censored models on massive data
Gabriela Ciuperca
TL;DR
This work addresses scalable variable selection in right-censored data by partitioning massive samples into $K$ groups with $K=o(n)$ and constructing aggregated censored adaptive LASSO estimators. It introduces four estimation regimes—median, quantile, expectile, and LS—each with an adaptive penalty to recover the true sparsity pattern and achieve asymptotic normality for nonzero coefficients, matching the full-data oracle properties. A BIC-type criterion guides tuning-parameter selection, enabling practical model selection in large datasets while preserving surveillance of the survival function via the aggregated approach. Monte Carlo experiments confirm that aggregation substantially reduces computation time without compromising statistical properties, and reveal insights into the influence of $K$, $p$, and $w$ on variable selection performance across methods.
Abstract
This article considers the automatic selection problem of the relevant explanatory variables in a right-censored model on a massive database. We propose and study four aggregated censored adaptive LASSO estimators constructed by dividing the observations in such a way as to keep the consistency of the estimator of the survival curve. We show that these estimators have the same theoretical oracle properties as the one built on the full database. Moreover, by Monte Carlo simulations we obtain that their calculation time is smaller than that of the full database. The simulations confirm also the theoretical properties. For optimal tuning parameter selection, we propose a BIC-type criterion.
