Table of Contents
Fetching ...

kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

Parastoo Pashmchi, Jérôme Benoit, Motonobu Kanagawa

TL;DR

This work addresses the challenge of imputing missing values by preserving the full conditional distribution P(y|x) rather than only its mean. It introduces kNNSampler, a simple stochastic imputation that samples from the empirical distribution of the k nearest neighbors, effectively performing a nearest-neighbor random hot deck. The authors provide a solid theoretical foundation using RKHS embeddings, proving consistency and convergence rates that scale with the intrinsic dimension of the covariate distribution, and they validate the method on synthetic and real data with uncertainty quantification and support for multiple imputation. Empirically, kNNSampler recovers missing-value distributions more faithfully than several baselines, and its uncertainty estimates converge to nominal coverage, suggesting practical utility for downstream analyses that require well-calibrated imputations.

Abstract

We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments illustrate the performance of kNNSampler. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).

kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

TL;DR

This work addresses the challenge of imputing missing values by preserving the full conditional distribution P(y|x) rather than only its mean. It introduces kNNSampler, a simple stochastic imputation that samples from the empirical distribution of the k nearest neighbors, effectively performing a nearest-neighbor random hot deck. The authors provide a solid theoretical foundation using RKHS embeddings, proving consistency and convergence rates that scale with the intrinsic dimension of the covariate distribution, and they validate the method on synthetic and real data with uncertainty quantification and support for multiple imputation. Empirically, kNNSampler recovers missing-value distributions more faithfully than several baselines, and its uncertainty estimates converge to nominal coverage, suggesting practical utility for downstream analyses that require well-calibrated imputations.

Abstract

We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments illustrate the performance of kNNSampler. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).

Paper Structure

This paper contains 36 sections, 4 theorems, 71 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $(x_1, y_1), \dots, (x_n, y_n) \stackrel{i.i.d.}{\sim} P(y|x)P(x)$ and $\hat{P}(y|x)$ be the kNN conditional distribution (eq:cond-dist-est) with $k$ nearest neighbours. Suppose that Assumptions as:lipschitz, as:bounded, as:doubling-dimensions and as:VC-dimensions hold. Let $0 < \delta < 1$. Th holds simultaneously for all $x \in \mathcal{X}$, $k \in \{1, \dots, n\}$ and $0 < r < r_{\rm max}$

Figures (6)

  • Figure 1: Comparison of imputations by kNNImputer (left) and kNNSampler (right). In each figure, $x$ and $y$ are the covariate and response, respectively. Blue points are observed covariate-response pairs, green points are true missing values and red points are imputed values. For details, see Section \ref{['sec:simulation']}.
  • Figure 2: Comparison of the samples of the true conditional distribution $P(y|x)$ of missing response $y$ of a unit with covariate $x = 0.5$ (blue) and the kNN conditional distribution $\hat{P}(y|x)$ with $k = 1,000$ (orange) on the noisy ring data in Figure \ref{['fig:ring-demo-intro']} with sample size $10,000$. The imputations by kNNImputer with $k = 5$ are shown as the green dotted vertical line.
  • Figure 3: Missing value imputations by different methods for a dataset from the linear chi-square model (\ref{['eq:synthetic-data-linear']}) with sample size $N = 10,000$ with $30\%$ missing rate under the MAR mechanism. True missing responses are shown in green, imputations in red, and the rest in blue.
  • Figure 4: Missing value imputations by different methods for a dataset from the noisy ring model (\ref{['eq:synthetic-data-ring-data']}) with sample size $N = 10,000$ with $30\%$ missing rate under the MAR mechanism. True missing responses are shown in green, imputations in red, and the rest in blue.
  • Figure 5: Coverage probabilities of kNN prediction intervals at different missing rates (MR) for different sample sizes. The mean and standard deviation over 10 independent runs are shown for each setting. The top three figures are on the noisy ring data, and the bottom three are on the linear chi-square data.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Remark 1
  • Remark 2
  • Theorem 1
  • proof
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof