Table of Contents
Fetching ...

Learning Counterfactual Distributions via Kernel Nearest Neighbors

Kyuseong Choi, Jacob Feitelberg, Caleb Chin, Anish Agarwal, Raaz Dwivedi

TL;DR

The paper addresses learning counterfactual distributions for unit-outcome entries under MNAR missingness with limited samples by formulating a distributional matrix completion problem and introducing kernel-NN, a nearest-neighbors method operating on kernel mean embeddings. It combines a latent-factor model with MMD-based distances to estimate full distributions, providing instance-dependent guarantees that hold under MNAR and non-positivity, including staggered adoption and propensity-based missingness. Theoretical results show consistent distributional recovery and a distributional treatment effect ($\mathrm{iDTE}$) estimator, while experiments on simulated data and the HeartSteps mobile health study demonstrate accurate distributional imputation and favorable compared to scalar baselines. This approach offers a principled, scalable route for learning and comparing multivariate counterfactual distributions in causal panel data with complex missingness patterns.}$

Abstract

Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.

Learning Counterfactual Distributions via Kernel Nearest Neighbors

TL;DR

The paper addresses learning counterfactual distributions for unit-outcome entries under MNAR missingness with limited samples by formulating a distributional matrix completion problem and introducing kernel-NN, a nearest-neighbors method operating on kernel mean embeddings. It combines a latent-factor model with MMD-based distances to estimate full distributions, providing instance-dependent guarantees that hold under MNAR and non-positivity, including staggered adoption and propensity-based missingness. Theoretical results show consistent distributional recovery and a distributional treatment effect () estimator, while experiments on simulated data and the HeartSteps mobile health study demonstrate accurate distributional imputation and favorable compared to scalar baselines. This approach offers a principled, scalable route for learning and comparing multivariate counterfactual distributions in causal panel data with complex missingness patterns.}$

Abstract

Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data, where observations are available only for certain unit-outcome combinations and the observation availability can be correlated with the properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. Furthermore, we demonstrate that our nearest neighbors approach is robust to heteroscedastic noise, provided we have access to two or more measurements for the observed unit-outcome entries, a robustness not present in prior works on nearest neighbors with single measurements.

Paper Structure

This paper contains 45 sections, 9 theorems, 42 equations, 8 figures.

Key Result

Proposition 1

Suppose the observed measurements and missingness from model model : dist matrix completion respect assump:factorizationassump:latent-independenceassump : unobs confoundingassump : measurement generation. Then for any values of $\eta,\delta > 0$, the estimator $\widehat{\mu}_{1, 1, \eta}$ of $\texts

Figures (8)

  • Figure 1: HeartSteps app user's per hour step count distribution Each figures contain information of the step counts for different participants in the HeartSteps study klasnja2019efficacy (see \ref{['sec:application']} for details). Left panel contains their per hour step count distribution for two different participants who received notification, where each step counts are measured at different time points during study. The right panel contains the observed per hour step count distribution for one of the participants from the left panel, and also contains the estimated (using $\textsc{kernel-NN}\xspace$) counterfactual step count distribution for the same participant. The dashed lines are the averages of the histograms with corresponding colors.
  • Figure 3: Missingness of staggered random adoption and MCAR For panel (a), control units are colored (blue) until adoption time, that respects \ref{['assump : confounded stagger']} --- refer to \ref{['app:sim']} for details. For panel (b), colored (blue) entries are observed completely at random with observation probability $p = 0.5$.
  • Figure 4: Comparing $\textsc{kernel-NN}\xspace$ and empirical distribution of observed samples for simulated data Each column compares how the summary statistics of the empirical distribution $\mu_{1, T}^{(Z)}$ of observed samples and $\textsc{kernel-NN}\xspace$ output $\widehat{\mu}_{1, T, \widehat{\eta}_{\mathrm{cv}}}$ approximate that of the estimand $\mu_{1, T}$.
  • Figure 5: Squared $\mathop{\mathrm{MMD}}\nolimits$ error of cross-validated kernel-NN by dimension $d$ and missing pattern Panel (a) depicts the squared MMD error decay of $\textsc{kernel-NN}\xspace$ as $N$ increase for different measurement dimension $d$, under the staggered adoption missingness (see panel (a) of \ref{['fig:missingness']} for missingness pattern), and panel (b) depicts analogous information under the MCAR missingness (see panel (b) of \ref{['fig:missingness']} for missingness pattern).
  • Figure 6: Comparing two versions of $\textsc{kernel-NN}\xspace$ for simulated data Under the staggered adoption setup with fixed measurement dimension $d = 4$, panel (a) depicts the square $\mathop{\mathrm{MMD}}\nolimits$ error of $\widehat{\mu}_{i, t, \widehat{\eta}_{\mathrm{dir}}}$ (denoted Kernel-NN Direct) and $\widehat{\mu}_{i, t, \widehat{\eta}_{\mathrm{cv}}}$ (denoted Kernel-NN CV). Panel (b) depicts the training time (in seconds) for $\widehat{\eta}_{\mathrm{dir}}$ and $\widehat{\eta}_{\mathrm{cv}}$ to be selected.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Example 1: Location-scale family
  • Example 2: Infinite-dimensional family
  • Remark 1
  • Proposition 1: Instance dependent guarantee
  • Theorem 1: Staggered adoption guarantee
  • Corollary 1: Guarantees for specific examples under staggered adoption
  • Corollary 2: $\mathrm{\texttt{i}DTE}$ guarantee under staggered adoption
  • Theorem 2: Propensity-based guarantee
  • Corollary 3: Guarantees for specific examples under MCAR
  • Lemma 1: Recovering model and algorithm of li2019nearest
  • ...and 2 more