Table of Contents
Fetching ...

Practical considerations for variable screening in the super learner

Brian D. Williamson, Drew King, Ying Huang

TL;DR

This paper investigates how variable screening affects the performance of the super learner in high-dimensional prediction tasks. It compares the lasso screener with a diverse library of screeners across simulated data and an HIV-1 Env prediction problem, using nested cross-validation to assess performance on continuous and binary outcomes. The key finding is that while the lasso screener can underperform under nonlinear relationships or correlated features, a broad set of screeners generally protects against misspecification and can outperform no-screening approaches; in HIV data, the discrete SL (dSL) often matches or approaches the convex ensemble SL (cSL) when screening is varied. The work provides practical guidance to include a diverse screener pool within the SL library and makes available code to reproduce the numerical experiments.

Abstract

Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.

Practical considerations for variable screening in the super learner

TL;DR

This paper investigates how variable screening affects the performance of the super learner in high-dimensional prediction tasks. It compares the lasso screener with a diverse library of screeners across simulated data and an HIV-1 Env prediction problem, using nested cross-validation to assess performance on continuous and binary outcomes. The key finding is that while the lasso screener can underperform under nonlinear relationships or correlated features, a broad set of screeners generally protects against misspecification and can outperform no-screening approaches; in HIV data, the discrete SL (dSL) often matches or approaches the convex ensemble SL (cSL) when screening is varied. The work provides practical guidance to include a diverse screener pool within the SL library and makes available code to reproduce the numerical experiments.

Abstract

Estimating a prediction function is a fundamental component of many data analyses. The super learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms (screeners), including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a super learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screeners should be used to protect against poor performance of any one screener, similar to the guidance for choosing a library of prediction algorithms for the super learner. These results are further illustrated through the analysis of HIV-1 antibody data.
Paper Structure (13 sections, 1 equation, 10 figures, 7 tables)

This paper contains 13 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Prediction performance versus sample size $n$, measured using cross-validated R-squared, for predicting a continuous outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
  • Figure 2: Prediction performance versus sample size $n$, measured using cross-validated AUC, for predicting a binary outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
  • Figure 3: Prediction performance versus sample size $n$, measured using cross-validated R-squared, for predicting a continuous outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
  • Figure 4: Prediction performance versus sample size $n$, measured using cross-validated AUC, for predicting a binary outcome. There is a weak relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
  • Figure S1: Prediction performance versus sample size $n$, measured using cross-validated non-negative log likelihood (NN log lik.), for predicting a binary outcome. There is a strong relationship between outcome and features. The top row shows results for correlated features, while the bottom row shows results for uncorrelated features. The left-hand column shows results for a linear outcome-feature relationship, while the right-hand column shows results for a nonlinear outcome-feature relationship. The dashed line denotes the best-possible prediction performance in each setting. Color denotes the variable screeners, while shape denotes the estimator (lasso, convex ensemble super learner [cSL], and discrete super learner [dSL]). Note that the y-axis limits differ between panels.
  • ...and 5 more figures