Penalized Generative Variable Selection
Tong Wang, Jian Huang, Shuangge Ma
TL;DR
The paper introduces a penalized two-stage framework for variable selection in high-dimensional predictor settings using Conditional Wasserstein GANs. Stage 1 performs variable selection by applying a group Lasso penalty to the generator’s first-layer weights, identifying important predictors, while Stage 2 refines estimation with only the selected subset. It extends the approach to censored survival data via Kaplan-Meier weighting and establishes convergence rates and selection consistency under varying dimensionality, along with practical validation through simulations and real-data analyses (TCGA-LUAD, MIMIC-III albumin and survival, HIV mutations). The combination of distribution-matching based estimation, theoretical guarantees, and strong empirical performance offers a robust tool for sparse high-dimensional modeling in biomedical contexts. The practical impact lies in reliable variable selection and distribution-aware estimation in settings with censorship and complex predictor structures, enabling better interpretability and prediction.
Abstract
Deep networks are increasingly applied to a wide variety of data, including data with high-dimensional predictors. In such analysis, variable selection can be needed along with estimation/model building. Many of the existing deep network studies that incorporate variable selection have been limited to methodological and numerical developments. In this study, we consider modeling/estimation using the conditional Wasserstein Generative Adversarial networks. Group Lasso penalization is applied for variable selection, which may improve model estimation/prediction, interpretability, stability, etc. Significantly advancing from the existing literature, the analysis of censored survival data is also considered. We establish the convergence rate for variable selection while considering the approximation error, and obtain a more efficient distribution estimation. Simulations and the analysis of real experimental data demonstrate satisfactory practical utility of the proposed analysis.
