Table of Contents
Fetching ...

Optimal Transport with Heterogeneously Missing Data

Linus Bleistein, Aurélien Bellet, Julie Josse

TL;DR

This paper develops principled methods for optimal transport with heterogeneously missing data under MCAR. It provides a debiasing framework for the Bures-Wasserstein distance in Gaussian and linear-Monge settings and introduces a matrix-completion based, ISVT-driven estimator for entropic OT, accompanied by a cross-validation-free BW criterion for hyperparameter selection. The results include dimension-free convergence guarantees, domain-adaptation bounds under missing data, and extensive experiments across synthetic and real datasets demonstrating robustness to MCAR/MNAR and improved OT estimation. Overall, the work enables reliable distributional comparisons and transport-based analyses when data are incomplete, with practical implications for domain adaptation and statistical testing under missingness.

Abstract

We consider the problem of solving the optimal transport problem between two empirical distributions with missing values. Our main assumption is that the data is missing completely at random (MCAR), but we allow for heterogeneous missingness probabilities across features and across the two distributions. As a first contribution, we show that the Wasserstein distance between empirical Gaussian distributions and linear Monge maps between arbitrary distributions can be debiased without significantly affecting the sample complexity. Secondly, we show that entropic regularized optimal transport can be estimated efficiently and consistently using iterative singular value thresholding (ISVT). We propose a validation set-free hyperparameter selection strategy for ISVT that leverages our estimator of the Bures-Wasserstein distance, which could be of independent interest in general matrix completion problems. Finally, we validate our findings on a wide range of numerical applications.

Optimal Transport with Heterogeneously Missing Data

TL;DR

This paper develops principled methods for optimal transport with heterogeneously missing data under MCAR. It provides a debiasing framework for the Bures-Wasserstein distance in Gaussian and linear-Monge settings and introduces a matrix-completion based, ISVT-driven estimator for entropic OT, accompanied by a cross-validation-free BW criterion for hyperparameter selection. The results include dimension-free convergence guarantees, domain-adaptation bounds under missing data, and extensive experiments across synthetic and real datasets demonstrating robustness to MCAR/MNAR and improved OT estimation. Overall, the work enables reliable distributional comparisons and transport-based analyses when data are incomplete, with practical implications for domain adaptation and statistical testing under missingness.

Abstract

We consider the problem of solving the optimal transport problem between two empirical distributions with missing values. Our main assumption is that the data is missing completely at random (MCAR), but we allow for heterogeneous missingness probabilities across features and across the two distributions. As a first contribution, we show that the Wasserstein distance between empirical Gaussian distributions and linear Monge maps between arbitrary distributions can be debiased without significantly affecting the sample complexity. Secondly, we show that entropic regularized optimal transport can be estimated efficiently and consistently using iterative singular value thresholding (ISVT). We propose a validation set-free hyperparameter selection strategy for ISVT that leverages our estimator of the Bures-Wasserstein distance, which could be of independent interest in general matrix completion problems. Finally, we validate our findings on a wide range of numerical applications.

Paper Structure

This paper contains 71 sections, 17 theorems, 185 equations, 10 figures, 1 algorithm.

Key Result

Proposition 4.1

Let $\overline{\mathbf{M}} := \mathbf{M} - \mathbf{P}\mathbf{M}\mathbf{Q} = [(1-p_iq_j)m_{ij}]_{i,j=1}^{d}$. The transport map $\overline{\mathbf{\Pi}}_\mathbf{M}$ solves the optimal transport problem

Figures (10)

  • Figure 1: Illustration of the effect of missing data on the optimal transport matching in a toy example. Left panels: data without missing values (top) and with missing values (bottom)---the downsized points have at least one missing coordinate. Central panels: missing data is imputed by $0$ (not shown in this figure), introducing bias in the optimal transport plan. Right panels: optimal coupling matrices.
  • Figure 2: Value of the entropic regularized optimal transport problem as a function of $\Lambda \geq 0$ when $90\%$ of observations are missing.
  • Figure 3: Convergence of our estimator compared to the biased estimator. We sample two fixed datasets with given covariance and mean. For every sample size, we average the estimated distance over $50$ missingness masks. Left: convergence of the Bures-Wasserstein distance estimator. Center: separate convergence for all three terms.
  • Figure 4: Convergence of our estimator over different uniform missingness probabilities. For every value of $p$, we set $\mathbf{p} = (p,\dots, p)$ and $\mathbf{q} = (p,\dots,p)$. Lighter colors correspond to higher $p$ and hence to less missing variables.
  • Figure 5: Robustness to MNAR data of our estimator (on top) vs. the biased estimator (on bottom). The colors of the heatmap indicate the error in Bures-Wasserstein estimation.
  • ...and 5 more figures

Theorems & Definitions (34)

  • Definition 3.1
  • Proposition 4.1
  • Proposition 4.2
  • Lemma 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 6.1
  • Definition A.1: sub-Gaussian Random Variable
  • Definition A.2
  • Lemma A.3
  • ...and 24 more