Table of Contents
Fetching ...

Right-censored models on massive data

Gabriela Ciuperca

TL;DR

This work addresses scalable variable selection in right-censored data by partitioning massive samples into $K$ groups with $K=o(n)$ and constructing aggregated censored adaptive LASSO estimators. It introduces four estimation regimes—median, quantile, expectile, and LS—each with an adaptive penalty to recover the true sparsity pattern and achieve asymptotic normality for nonzero coefficients, matching the full-data oracle properties. A BIC-type criterion guides tuning-parameter selection, enabling practical model selection in large datasets while preserving surveillance of the survival function via the aggregated approach. Monte Carlo experiments confirm that aggregation substantially reduces computation time without compromising statistical properties, and reveal insights into the influence of $K$, $p$, and $w$ on variable selection performance across methods.

Abstract

This article considers the automatic selection problem of the relevant explanatory variables in a right-censored model on a massive database. We propose and study four aggregated censored adaptive LASSO estimators constructed by dividing the observations in such a way as to keep the consistency of the estimator of the survival curve. We show that these estimators have the same theoretical oracle properties as the one built on the full database. Moreover, by Monte Carlo simulations we obtain that their calculation time is smaller than that of the full database. The simulations confirm also the theoretical properties. For optimal tuning parameter selection, we propose a BIC-type criterion.

Right-censored models on massive data

TL;DR

This work addresses scalable variable selection in right-censored data by partitioning massive samples into groups with and constructing aggregated censored adaptive LASSO estimators. It introduces four estimation regimes—median, quantile, expectile, and LS—each with an adaptive penalty to recover the true sparsity pattern and achieve asymptotic normality for nonzero coefficients, matching the full-data oracle properties. A BIC-type criterion guides tuning-parameter selection, enabling practical model selection in large datasets while preserving surveillance of the survival function via the aggregated approach. Monte Carlo experiments confirm that aggregation substantially reduces computation time without compromising statistical properties, and reveal insights into the influence of , , and on variable selection performance across methods.

Abstract

This article considers the automatic selection problem of the relevant explanatory variables in a right-censored model on a massive database. We propose and study four aggregated censored adaptive LASSO estimators constructed by dividing the observations in such a way as to keep the consistency of the estimator of the survival curve. We show that these estimators have the same theoretical oracle properties as the one built on the full database. Moreover, by Monte Carlo simulations we obtain that their calculation time is smaller than that of the full database. The simulations confirm also the theoretical properties. For optimal tuning parameter selection, we propose a BIC-type criterion.

Paper Structure

This paper contains 14 sections, 4 theorems, 78 equations, 6 figures, 6 tables.

Key Result

Theorem 1

Under assumptions (A1)-(A8), if moreover $K=o(n)$, $w=O(K^{1/2})$, $F_\varepsilon(0)=1/2$, $(\lambda_n)_{n \in \mathbb{N}}$ satisfies (eln), then: (i) $\lim_{n \rightarrow \infty} \mathbb{P}[ \overset{\vee}{\cal A}_n= {\cal A} ]=1$, $\lim_{n \rightarrow\infty} \mathbb{P}[ \overset{\vee}{\textrm{$\ma

Figures (6)

  • Figure 1: The histograms of indices $j \in \{1, \cdots , 20\}$ which minimize the BIC criteria for a tuning parameter of the form $\lambda_n=n^{1/2-1/(10j)}$, for $n=1000$, censored adaptive LASSO expectile and quantile methods.
  • Figure 2: Study of the estimator $\overset{\vee}{\textrm{$\mathbf{\beta}$}}_{n}$ with respect to values of $w$, for $K \in \{25, 125\}$, $p=50$.
  • Figure 3: Study of the estimator $\overset{\vee}{\textrm{$\mathbf{\beta}$}}_{n}$ with respect to values of $K$, for $w \in \{1, 5\}$, $p=50$.
  • Figure 4: The histogram of $Dn1 \equiv {\sqrt n}(\overset{\vee}{\beta}_{n,1}-\beta^0_1)$ by aggregated censored adaptive LASSO methods.
  • Figure 5: Study of false zeros and of $\|(\overset{\vee}{\textrm{$\mathbf{\beta}$}}_{n}-\textrm{$\mathbf{\beta}^0$} )_{\cal A} \|_1$ with respect to values of $\|\textrm{$\mathbf{\beta}$}^0_{\cal A}\|_1$ values, when ${\cal A}=\{1\}$ and $\beta^0_1=1/(2j)$, $j \in \{1, \cdots , 20\}$, $K=25$, $w=[\sqrt{K}]$.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4