Imputation Uncertainty in Interpretable Machine Learning Methods
Pegah Golchian, Marvin N. Wright
TL;DR
This work addresses how missing data and imputation uncertainty influence interpretable machine learning explanations (PD, PFI, SHAP). It extends the learner-Ψ framework to include imputation uncertainty and evaluates CI coverage, width, and bias under MCAR/MAR/MNAR using both single and multiple imputation across linear and non-linear data-generating processes. The main findings show that single imputation underestimates variance and often misguides interpretation, while multiple imputation substantially improves CI coverage, albeit with wider intervals, with method choice (MICE PMM vs MICE RF) depending on the data-generating process. A real-data example confirms that ignoring imputation uncertainty can drastically alter inferred feature importance, recommending multiple imputation to enhance interpretability in the presence of missing values.
Abstract
In real data, missing values occur frequently, which affects the interpretation with interpretable machine learning (IML) methods. Recent work considers bias and shows that model explanations may differ between imputation methods, while ignoring additional imputation uncertainty and its influence on variance and confidence intervals. We therefore compare the effects of different imputation methods on the confidence interval coverage probabilities of the IML methods permutation feature importance, partial dependence plots and Shapley values. We show that single imputation leads to underestimation of variance and that, in most cases, only multiple imputation is close to nominal coverage.
