Masking criteria for selecting an imputation model
Yanjiao Yang, Daniel Suen, Yen-Chi Chen
TL;DR
This work interrogates how to select imputation models under missing data using masking criteria. It first analyzes the conventional mask-one-out (MOO) procedure, showing its optimal target is the marginal distribution $p(x_j|x_r,r\oplus e_j)$ and that it may ignore stochasticity, motivating MOORT and MOOEN as distributional alternatives. It then introduces a likelihood-based framework (MOO likelihood) to learn imputation models from data, establishes identifiability, asymptotic normality, and BIC-based model selection, and connects masking to MAR/MCAR in monotone missing data. Across simulations and real data, MOORT and MOOEN provide robust, distributionally faithful imputation utilities while offering practical tools for comparing and learning imputation models. The results yield a practical visualization, the Prediction-Imputation diagram, to balance predictive accuracy with imputation fidelity in applied settings.
Abstract
The masking-one-out (MOO) procedure, masking an observed entry and comparing it versus its imputed values, is a very common procedure for comparing imputation models. We study the optimum of this procedure and generalize it to a missing data assumption and establish the corresponding semi-parametric efficiency theory. However, MOO is a measure of prediction accuracy, which is not ideal for evaluating an imputation model. To address this issue, we introduce three modified MOO criteria, based on rank transformation, energy distance, and likelihood principle, that allow us to select an imputation model that properly account for the stochastic nature of data. The likelihood approach further enables an elegant framework of learning an imputation model from the data and we derive its statistical and computational learning theories as well as consistency of BIC model selection. We also show how MOO is related to the missing-at-random assumption. Finally, we introduce the prediction-imputation diagram, a two-dimensional diagram visually comparing both the prediction and imputation utilities for various imputation models.
