Table of Contents
Fetching ...

MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation

Robert Kuchen

TL;DR

This work tackles variable selection after multiple imputation for prediction, where pooling across imputations can be unstable or overly complex. It introduces MIBoost, a uniform base‑learner selection mechanism within component‑wise gradient boosting that operates across $M$ imputed datasets and yields a single, averaged predictor. In simulations, MIBoost achieves predictive performance comparable to recent unified loss approaches (SaLASSO, SaENET) while providing similar or reduced model complexity relative to estimate averaging, highlighting improvements in stability and interpretability. The method, implemented in the R package booami, offers a practical, scalable solution for reliable variable selection after multiple imputation with gradient boosting.

Abstract

Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets. Simulation studies suggest that our approach yields prediction performance comparable to that of these recently proposed methods.

MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation

TL;DR

This work tackles variable selection after multiple imputation for prediction, where pooling across imputations can be unstable or overly complex. It introduces MIBoost, a uniform base‑learner selection mechanism within component‑wise gradient boosting that operates across imputed datasets and yields a single, averaged predictor. In simulations, MIBoost achieves predictive performance comparable to recent unified loss approaches (SaLASSO, SaENET) while providing similar or reduced model complexity relative to estimate averaging, highlighting improvements in stability and interpretability. The method, implemented in the R package booami, offers a practical, scalable solution for reliable variable selection after multiple imputation with gradient boosting.

Abstract

Statistical learning methods for automated variable selection, such as LASSO, elastic nets, or gradient boosting, have become increasingly popular tools for building powerful prediction models. Yet, in practice, analyses are often complicated by missing data. The most widely used approach to address missingness is multiple imputation, which involves creating several completed datasets. However, there is an ongoing debate on how to perform model selection in the presence of multiple imputed datasets. Simple strategies, such as pooling models across datasets, have been shown to have suboptimal properties. Although more sophisticated methods exist, they are often difficult to implement and therefore not widely applied. In contrast, two recent approaches modify the regularization methods LASSO and elastic nets by defining a single loss function, resulting in a unified set of coefficients across imputations. Our key contribution is to extend this principle to the framework of component-wise gradient boosting by proposing MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets. Simulation studies suggest that our approach yields prediction performance comparable to that of these recently proposed methods.

Paper Structure

This paper contains 15 sections, 16 equations, 1 table.