Table of Contents
Fetching ...

Optimal Transport for Structure Learning Under Missing Data

Vy Vo, He Zhao, Trung Le, Edwin V. Bonilla, Dinh Phung

TL;DR

This work tackles causal discovery when data are missing, showing that naive imputation prior to structure learning is suboptimal. It introduces OTM, an Optimal Transport-based, score-based framework that jointly learns missing-data imputations and a causal graph by minimizing the Wasserstein distance between the observed-data distribution and the model distribution, using a learnable imputation and a push-forward to align completed samples with the SCM. The approach is agnostic to the base complete-data causal discovery method and accommodates nonlinear additive-noise models, demonstrating superior recovery of true graphs and better scalability in simulations and real biological datasets. The method provides a principled way to perform structure learning under MAR/MNAR settings and highlights identifiability considerations, with potential impact on robust causal inference in real-world messy data.

Abstract

Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or, preferably, causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirically shown to be sub-optimal. To address this problem, we propose a score-based algorithm for learning causal structures from missing data based on optimal transport. This optimal transport viewpoint diverges from existing score-based approaches that are dominantly based on expectation maximization. We formulate structure learning as a density fitting problem, where the goal is to find the causal model that induces a distribution of minimum Wasserstein distance with the observed data distribution. Our framework is shown to recover the true causal graphs more effectively than competing methods in most simulations and real-data settings. Empirical evidence also shows the superior scalability of our approach, along with the flexibility to incorporate any off-the-shelf causal discovery methods for complete data.

Optimal Transport for Structure Learning Under Missing Data

TL;DR

This work tackles causal discovery when data are missing, showing that naive imputation prior to structure learning is suboptimal. It introduces OTM, an Optimal Transport-based, score-based framework that jointly learns missing-data imputations and a causal graph by minimizing the Wasserstein distance between the observed-data distribution and the model distribution, using a learnable imputation and a push-forward to align completed samples with the SCM. The approach is agnostic to the base complete-data causal discovery method and accommodates nonlinear additive-noise models, demonstrating superior recovery of true graphs and better scalability in simulations and real biological datasets. The method provides a principled way to perform structure learning under MAR/MNAR settings and highlights identifiability considerations, with potential impact on robust causal inference in real-world messy data.

Abstract

Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or, preferably, causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirically shown to be sub-optimal. To address this problem, we propose a score-based algorithm for learning causal structures from missing data based on optimal transport. This optimal transport viewpoint diverges from existing score-based approaches that are dominantly based on expectation maximization. We formulate structure learning as a density fitting problem, where the goal is to find the causal model that induces a distribution of minimum Wasserstein distance with the observed data distribution. Our framework is shown to recover the true causal graphs more effectively than competing methods in most simulations and real-data settings. Empirical evidence also shows the superior scalability of our approach, along with the flexibility to incorporate any off-the-shelf causal discovery methods for complete data.
Paper Structure (36 sections, 2 theorems, 16 equations, 13 figures, 1 algorithm)

This paper contains 36 sections, 2 theorems, 16 equations, 13 figures, 1 algorithm.

Key Result

Lemma 3.2

For $h_c, h_m$ defined as above, if $h_c$ is optimal in the sense that $h_c$ recovers the original data i.e., $h_c({\bm{X}}^{j}_{{\textnormal{O}}}) = {\bm{X}}^{j}, \forall j \in [n]$, we have where $h_c\#\mu_{{\mathcal{D}}}({\bm{X}}_{{\textnormal{O}}}) = n^{-1} \sum^{n}_{j=1} \delta_{h_c({\bm{X}}^{j}_{{\textnormal{O}}})}$, which also represents empirical distribution over the true data.

Figures (13)

  • Figure 1: Visualization of the quality of imputation vs. causal discovery performance. Better imputation in terms of reconstruction quality does not always imply more accurate structure learning.
  • Figure 2: Visualization of the optimization process of OTM. ${\mathbf{X}}$ is the ground-true complete data. $\widehat{{\mathbf{X}}} = f_{\theta}\left[\phi({\mathbf{X}}_{{\textnormal{O}}}) \right]$ is estimated complete data from the model. Top: As training progresses, the model generates imputations that are closer to the original data both by Euclidean distance (value-wise) and Wasserstein distance (distribution-wise). Bottom: The quality of the estimated graph improves accordingly over training.
  • Figure 3: Nonlinear ANMs (MCAR). SK refers to batch Sinkhorn imputation and RR refers to round-robin Sinkhorn imputation. SHD $\downarrow$ and F1 $\uparrow$.
  • Figure 4: Real-world datasets (MCAR). SK refers to batch Sinkhorn imputation and RR refers to round-robin Sinkhorn imputation. SHD $\downarrow$ and F1 $\uparrow$.
  • Figure 5: Scalability of methods in nonlinear ANMs (MCAR) at 10% missing rate. SK refers to batch Sinkhorn imputation and RR refers to round-robin Sinkhorn imputation. SHD $\downarrow$ and F1 $\uparrow$. The training time of the imputation baselines includes the time for learning imputations.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Lemma 3.2
  • Theorem 3.3
  • proof
  • proof