missForestPredict -- Missing data imputation for prediction settings
Elena Albu, Shan Gao, Laure Wynants, Ben Van Calster
TL;DR
The paper addresses missing data in prediction contexts by introducing missForestPredict, a prediction-focused extension of the missForest imputation algorithm that can impute new observations at prediction time. It provides an iterative, random-forest–based imputation framework with a unified NMSE-based convergence criterion and out-of-bag error tracking, saving per-variable/imputation-iteration models for later use. The authors compare missForestPredict to a broad set of imputation methods (mean/mode, linear, bagging, mice, miceRanger, IterativeImputer, and kNN) across extensive simulated and real datasets with various missingness mechanisms, showing competitive performance and favorable speed/memory profiles, especially on large data. Practical guidance emerges on when RF-based imputers excel versus when simpler methods suffice, and the paper highlights the benefits of using max.depth=10 to reduce computation without sacrificing accuracy. Overall, the work provides a robust, scalable tool for prediction-time imputation and offers a detailed benchmarking resource for selecting imputation strategies in real-world predictive tasks.
Abstract
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
