Table of Contents
Fetching ...

DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis

Hristo Petkov, Calum MacLellan, Feng Dong

Abstract

Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non-Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual-step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post-Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real-world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state-of-the-art), while being able to produce diverse, high-quality samples.

DAGAF: A directed acyclic generative adversarial framework for joint structure learning and tabular data synthesis

Abstract

Understanding the causal relationships between data variables can provide crucial insights into the construction of tabular datasets. Most existing causality learning methods typically focus on applying a single identifiable causal model, such as the Additive Noise Model (ANM) or the Linear non-Gaussian Acyclic Model (LiNGAM), to discover the dependencies exhibited in observational data. We improve on this approach by introducing a novel dual-step framework capable of performing both causal structure learning and tabular data synthesis under multiple causal model assumptions. Our approach uses Directed Acyclic Graphs (DAG) to represent causal relationships among data variables. By applying various functional causal models including ANM, LiNGAM and the Post-Nonlinear model (PNL), we implicitly learn the contents of DAG to simulate the generative process of observational data, effectively replicating the real data distribution. This is supported by a theoretical analysis to explain the multiple loss terms comprising the objective function of the framework. Experimental results demonstrate that DAGAF outperforms many existing methods in structure learning, achieving significantly lower Structural Hamming Distance (SHD) scores across both real-world and benchmark datasets (Sachs: 47%, Child: 11%, Hailfinder: 5%, Pathfinder: 7% improvement compared to state-of-the-art), while being able to produce diverse, high-quality samples.

Paper Structure

This paper contains 37 sections, 13 theorems, 42 equations, 9 figures, 10 tables, 1 algorithm.

Key Result

Proposition 0

Let the ground-truth DAG $\mathcal{G}_\mathcal{A}$ be uniquely identifiable from $P(\mathbf{X})$, then, under the causal identifiability assumption, minimizing adversarial loss ensures that the implicitly generated distribution $P_{G_A}(\tilde{\mathbf{X}})$ aligns with $P(\mathbf{X})$.

Figures (9)

  • Figure 1: Pipeline of the DAGAF algorithm
  • Figure 2: A Visual Representation of DAGAF. (a) The optimization structure under ANM and LiNGAM, where input data is processed to reconstruct $\tilde{\mathbf{X}}$ using multiple loss terms, excluding $L_{\text{KLD}}$ in the LiNGAM case. (b) The extended framework integrating ANM, LiNGAM, and PNL, where an additional inversion function $g^{-1}$ is introduced to compute $L_{\text{PNL}}$, unifying the optimization process. (c) The synthetic data generation process, illustrating how the framework enables structured data synthesis while preserving underlying causal relationships.
  • Figure 3: Comparison of the correlation matrices for real (left) and synthetic (right) features reveals that the statistical correlations across the feature space for both real and synthetic data are nearly identical, in both the ANM (first row) and the PNL (second row) case.
  • Figure 4: Principal Component Analysis (PCA) between the original and synthetic samples for both the ANM (left) and the PNL (right) case. We observe both the input and the synthetic samples have similar clusters and outliers. The results indicate that the implicitly generated distribution resembles the original distribution in both mean and standard deviation, making them indistinguishable from each other.
  • Figure 5: Feature importance comparison between real (left) and synthetic (right) data, in both the ANM (first row) and the PNL (second row) case. The synthetic features with their relevance are indistinguishable from the original ones, allowing for their application in regression tasks.
  • ...and 4 more figures

Theorems & Definitions (27)

  • Proposition 0
  • proof
  • Proposition 0
  • proof
  • Proposition 0
  • proof
  • Proposition 0
  • proof
  • Remark 1
  • Proposition 0
  • ...and 17 more