Causal-Aware Generative Adversarial Networks with Reinforcement Learning
Tu Anh Hoang Nguyen, Dang Nguyen, Tri-Nhan Vo, Thuc Duy Le, Sunil Gupta
TL;DR
CA-GAN addresses synthetic tabular data generation with explicit causal preservation by first extracting a causal graph $\mathcal{G}_{real}$ via a PC-based causal discovery and then training a graph-conditioned WGAN-GP with $M$ sub-generators. A reinforcement-learning-based causal loss uses a SHD-based reward to align the causal structure of synthetic data with the real data, optimizing through a policy-gradient that leverages log-likelihoods of mixed-type outputs. Across 14 datasets, CA-GAN achieves superior causal preservation, downstream utility, and privacy protection compared to six baselines, with competitive computation times. The approach enables practical, privacy-preserving synthetic data generation that maintains reliable causal inferences for enterprise analytics and secure research.
Abstract
The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data's manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.
