Table of Contents
Fetching ...

Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows

Gitte Kremling, Jeffrey Näf, Johannes Lederer

Abstract

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.

Generative Modeling under Non-Monotonic MAR Missingness via Approximate Wasserstein Gradient Flows

Abstract

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead, ad-hoc imputation methods are often used, for which it remains unclear whether the correct distribution can be recovered. In this paper, we propose FLOWGEM, a principled iterative method for generating a complete dataset from a dataset with values Missing at Random (MAR). Motivated by convergence results of the ignoring maximum likelihood estimator, our approach minimizes the expected Kullback-Leibler (KL) divergence between the observed data distribution and the distribution of the generated sample over different missingness patterns. To minimize the KL divergence, we employ a discretized particle evolution of the corresponding Wasserstein Gradient Flow, where the velocity field is approximated using a local linear estimator of the density ratio. This construction yields a data generation scheme that iteratively transports an initial particle ensemble toward the target distribution. Simulation studies and real-data benchmarks demonstrate that FLOWGEM achieves state-of-the-art performance across a range of settings, including the challenging case of non-monotonic MAR mechanisms. Together, these results position FLOWGEM as a principled and practical alternative to existing imputation methods, and a decisive step towards closing the gap between theoretical rigor and empirical performance.

Paper Structure

This paper contains 22 sections, 4 theorems, 41 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2

Assume that MAR holds and $\mathbb{P}(M=0 \mid X=x) > 0$ for almost all $x$. Then $\mathbb{P}_X$ is the unique solution to $\blacktriangleleft$$\blacktriangleleft$

Figures (5)

  • Figure 1: Three data matrices with missing values, each with three different patterns. The matrix on the left shows monotone missingness, while the middle and right show cases with non-monotone missingness. The middle matrix is the one used in the Example of Section \ref{['sec:emp_res']}.
  • Figure 2: Standardized Energy distance in log-scale (left) and quantile estimate (right) for the simulated example with uniform distribution and $n=2000$, repeated over $B=20$ times. The solid blue line on the right indicates the true quantile value of the uncontaminated distribution, while the dashed line shows the population quantile when missing values are ignored. Our method, FLOWGEM, clearly outperforms its competitors, minimizing the energy distance and estimating the quantile most accurately. We note that MICE also performs well, but unlike FLOWGEM, it lacks theoretical justification. Some values are cut off for readability: the energy distance for MIRI ranges from $6$ to $4 \cdot 10^{148}$ with a median of $2 \cdot 10^{46}$, and its quantile estimates reach as low as $-2.95$.
  • Figure 3: Standardized Energy distance in log-scale (left) and quantile estimate (right) for the simulated example with Gaussian distribution, $n=2000$, and repeated over $B=20$ times. The blue line on the right indicates the true quantile value of the uncontaminated distribution. Our method, FLOWGEM, belongs to one of the best methods, minimizing the energy distance and estimating the quantile most accurately. We note that both MICE and the Bayes method essentially use Gaussian regression and Gaussian approximations respectively, giving them a somewhat unfair edge in this example. Despite this edge, it turns out that increasing $T$ to 1500 further boosts FLOWGEM and closes the gap between our method and MICE or Bayes, but we did not include this result here.
  • Figure 4: Scatter plots of the first two dimensions of the generated samples for each method, for a single replication of the simulation study in Section \ref{['sec:toyexmpl']} with $n=2000$, $d=3$, and uniform distribution. The black square indicates the support $[0,1]^2$ of the true distribution.
  • Figure 5: Scatter plots of the first two dimensions of the generated samples for each method, for a single replication of the simulation study in Section \ref{['sec:toyexmpl']} with $n=2000$, $d=3$, and Gaussian distribution. We note that both MICE and the Bayes method essentially use Gaussian regression and Gaussian approximations respectively, giving them a somewhat unfair edge in this example.

Theorems & Definitions (9)

  • Remark 1: Choice of $\psi$
  • Proposition 2: Population consistency of the KL minimizer
  • Lemma 3: First variation of the objective
  • Lemma 4: Variational form for $h \circ r$
  • Proposition 5: Velocity field approximation error
  • proof : Proof of Proposition \ref{['prop:KLmin']}
  • proof : Proof of Lemma \ref{['lem:deriv']}
  • proof : Proof of Lemma \ref{['lem:key_obs_m']}
  • proof : Proof of Proposition \ref{['prop:approx_error']}