Missing At Random as Covariate Shift: Correcting Bias in Iterative Imputation
Luke Shannon, Song Liu, Katarzyna Reluga
TL;DR
This work addresses bias in iterative data imputation caused by MAR-induced covariate shift between observed and missing values. It reframes imputation as a risk minimisation problem and derives principled importance weights to align the training distribution with the unobserved target distribution. A bias-aware, weighted iterative imputation algorithm jointly estimates weights and conditional imputation models, using a density-ratio-based approach within a round-robin framework. Across eight diverse datasets, the proposed method reduces RMSE by up to 7% and Wasserstein distance by up to 20% relative to unweighted baselines, demonstrating practical improvements for downstream tasks while highlighting the importance of accounting for MAR in imputation.
Abstract
Accurate imputation of missing data is critical to downstream machine learning performance. We formulate missing data imputation as a risk minimisation problem, which highlights a covariate shift between the observed and unobserved data distributions. This covariate shift induced bias is not accounted for by popular imputation methods and leads to suboptimal performance. In this paper, we derive theoretically valid importance weights that correct for the induced distributional bias. Furthermore, we propose a novel imputation algorithm that jointly estimates both the importance weights and imputation models, enabling bias correction throughout the imputation process. Empirical results across benchmark datasets show reductions in root mean squared error and Wasserstein distance of up to 7% and 20%, respectively, compared to otherwise identical unweighted methods.
