kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
Parastoo Pashmchi, Jérôme Benoit, Motonobu Kanagawa
TL;DR
This work addresses the challenge of imputing missing values by preserving the full conditional distribution P(y|x) rather than only its mean. It introduces kNNSampler, a simple stochastic imputation that samples from the empirical distribution of the k nearest neighbors, effectively performing a nearest-neighbor random hot deck. The authors provide a solid theoretical foundation using RKHS embeddings, proving consistency and convergence rates that scale with the intrinsic dimension of the covariate distribution, and they validate the method on synthetic and real data with uncertainty quantification and support for multiple imputation. Empirically, kNNSampler recovers missing-value distributions more faithfully than several baselines, and its uncertainty estimates converge to nominal coverage, suggesting practical utility for downstream analyses that require well-calibrated imputations.
Abstract
We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments illustrate the performance of kNNSampler. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).
