Practical considerations for variable screening in the super learner
Brian D. Williamson, Drew King, Ying Huang
TL;DR
This paper investigates how variable screening affects the performance of the super learner in high-dimensional prediction tasks. It compares the lasso screener with a diverse library of screeners across simulated data and an HIV-1 Env prediction problem, using nested cross-validation to assess performance on continuous and binary outcomes. The key finding is that while the lasso screener can underperform under nonlinear relationships or correlated features, a broad set of screeners generally protects against misspecification and can outperform no-screening approaches; in HIV data, the discrete SL (dSL) often matches or approaches the convex ensemble SL (cSL) when screening is varied. The work provides practical guidance to include a diverse screener pool within the SL library and makes available code to reproduce the numerical experiments.
Abstract
Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.
