What Is a Good Imputation Under MAR Missingness?
Jeffrey Näf, Erwan Scornet, Julie Josse
TL;DR
This work analyzes missing data imputation under MAR from a distributional perspective, showing that per-variable imputation within FCS can identify correct conditionals even in nonparametric settings, provided overlap holds. It introduces three guiding properties for ideal imputers—distributional regression capability, nonlinearity capture, and robustness to covariate shifts—and presents mice-DRF, a distributional random forest-based imputation method meeting two of these criteria. The authors advocate energy distance as a principled evaluation metric for distributional fidelity, and demonstrate through simulations and the Air Quality dataset that RMSE often misorders methods, while distributional scores better reflect downstream performance. The findings stress that strong distributional shifts within MAR complicate imputation and that progress hinges on methods capable of drawing from the full conditional distributions under shifting covariate supports. Overall, the paper provides a principled framework for evaluating and developing MAR imputation methods beyond traditional RMSE-focused benchmarks. $ ext{Key contributions include}$: (i) formal MAR definitions and identifiability results for sequential, per-variable imputation; (ii) a concrete set of criteria for ideal imputation methods and the introduction of mice-DRF; (iii) an energy-distance-based evaluation framework for imputation quality; and (iv) empirical evidence that distributional imputers outperform marginal predictors in MAR settings with covariate shifts, while highlighting remaining challenges when shifts are strong.
Abstract
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first investigate whether the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.
