missForestPredict -- Missing data imputation for prediction settings

Elena Albu; Shan Gao; Laure Wynants; Ben Van Calster

missForestPredict -- Missing data imputation for prediction settings

Elena Albu, Shan Gao, Laure Wynants, Ben Van Calster

TL;DR

The paper addresses missing data in prediction contexts by introducing missForestPredict, a prediction-focused extension of the missForest imputation algorithm that can impute new observations at prediction time. It provides an iterative, random-forest–based imputation framework with a unified NMSE-based convergence criterion and out-of-bag error tracking, saving per-variable/imputation-iteration models for later use. The authors compare missForestPredict to a broad set of imputation methods (mean/mode, linear, bagging, mice, miceRanger, IterativeImputer, and kNN) across extensive simulated and real datasets with various missingness mechanisms, showing competitive performance and favorable speed/memory profiles, especially on large data. Practical guidance emerges on when RF-based imputers excel versus when simpler methods suffice, and the paper highlights the benefits of using max.depth=10 to reduce computation without sacrificing accuracy. Overall, the work provides a robust, scalable tool for prediction-time imputation and offers a detailed benchmarking resource for selecting imputation strategies in real-world predictive tasks.

Abstract

Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

missForestPredict -- Missing data imputation for prediction settings

TL;DR

Abstract

Paper Structure (44 sections, 21 figures, 15 tables)

This paper contains 44 sections, 21 figures, 15 tables.

Introduction
Methods
missForestPredict Algorithm
Comparison to alternative imputation methods
Overview of methodology
Datasets
Imputation methods
Amputation methods
Variable transformations
Prediction models
Comparison procedure
Considerations for failure situations
Code availability
Results
Results on simulated datasets with simulated missingness (amputation)
...and 29 more sections

Figures (21)

Figure 1: Imputed variable and predictor matrix
Figure 2: NMSE errors (deviations from true values) on test sets for simulated datasets simulated missingness: low correlation (0.1) and low AUROC (0.75). To facilitate visualisation, only four out of the twelve noise variables are included in the figure.
Figure 3: Prediction performance (BSS) for simulated datasets with simulated missingness: low correlation (0.1) and low AUROC (0.75)
Figure 4: NMSE errors (deviations from true values) on test sets for simulated datasets simulated missingness: low correlation (0.1) and high AUROC (0.9). To facilitate visualisation, only four out of the twelve noise variables are included in the figure.
Figure 5: Prediction performance (BSS) for simulated datasets with simulated missingness: low correlation (0.1) and high AUROC (0.9)
...and 16 more figures

missForestPredict -- Missing data imputation for prediction settings

TL;DR

Abstract

missForestPredict -- Missing data imputation for prediction settings

Authors

TL;DR

Abstract

Table of Contents

Figures (21)