Table of Contents
Fetching ...

missForestPredict -- Missing data imputation for prediction settings

Elena Albu, Shan Gao, Laure Wynants, Ben Van Calster

TL;DR

The paper addresses missing data in prediction contexts by introducing missForestPredict, a prediction-focused extension of the missForest imputation algorithm that can impute new observations at prediction time. It provides an iterative, random-forest–based imputation framework with a unified NMSE-based convergence criterion and out-of-bag error tracking, saving per-variable/imputation-iteration models for later use. The authors compare missForestPredict to a broad set of imputation methods (mean/mode, linear, bagging, mice, miceRanger, IterativeImputer, and kNN) across extensive simulated and real datasets with various missingness mechanisms, showing competitive performance and favorable speed/memory profiles, especially on large data. Practical guidance emerges on when RF-based imputers excel versus when simpler methods suffice, and the paper highlights the benefits of using max.depth=10 to reduce computation without sacrificing accuracy. Overall, the work provides a robust, scalable tool for prediction-time imputation and offers a detailed benchmarking resource for selecting imputation strategies in real-world predictive tasks.

Abstract

Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.

missForestPredict -- Missing data imputation for prediction settings

TL;DR

The paper addresses missing data in prediction contexts by introducing missForestPredict, a prediction-focused extension of the missForest imputation algorithm that can impute new observations at prediction time. It provides an iterative, random-forest–based imputation framework with a unified NMSE-based convergence criterion and out-of-bag error tracking, saving per-variable/imputation-iteration models for later use. The authors compare missForestPredict to a broad set of imputation methods (mean/mode, linear, bagging, mice, miceRanger, IterativeImputer, and kNN) across extensive simulated and real datasets with various missingness mechanisms, showing competitive performance and favorable speed/memory profiles, especially on large data. Practical guidance emerges on when RF-based imputers excel versus when simpler methods suffice, and the paper highlights the benefits of using max.depth=10 to reduce computation without sacrificing accuracy. Overall, the work provides a robust, scalable tool for prediction-time imputation and offers a detailed benchmarking resource for selecting imputation strategies in real-world predictive tasks.

Abstract

Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occurs at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion (unified for continuous and categorical variables and based on the out-of-bag error) is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Paper Structure (44 sections, 21 figures, 15 tables)

This paper contains 44 sections, 21 figures, 15 tables.

Figures (21)

  • Figure 1: Imputed variable and predictor matrix
  • Figure 2: NMSE errors (deviations from true values) on test sets for simulated datasets simulated missingness: low correlation (0.1) and low AUROC (0.75). To facilitate visualisation, only four out of the twelve noise variables are included in the figure.
  • Figure 3: Prediction performance (BSS) for simulated datasets with simulated missingness: low correlation (0.1) and low AUROC (0.75)
  • Figure 4: NMSE errors (deviations from true values) on test sets for simulated datasets simulated missingness: low correlation (0.1) and high AUROC (0.9). To facilitate visualisation, only four out of the twelve noise variables are included in the figure.
  • Figure 5: Prediction performance (BSS) for simulated datasets with simulated missingness: low correlation (0.1) and high AUROC (0.9)
  • ...and 16 more figures