Differentially Private Synthetic Data Generation for Relational Databases
Kaveh Alimohammadi, Hao Wang, Ojas Gulati, Akash Srivastava, Navid Azizan
TL;DR
The paper tackles differential privacy for synthetic relational data by avoiding master-table flattening and instead learning a bi-adjacency matrix that links individually DP-generated tables. It proves cross-table $k$-way marginals can be expressed as (fractional) linear functions of the bi-adjacency, enabling a scalable iterative optimization that targets worst-case marginal errors. The method combines a relaxed projected-gradient-descent solver with an unbiased recursive sampling scheme and a random-slicing scalability enhancement, all under DP budgets to guarantee privacy. Utility bounds are provided, along with convergence and runtime guarantees, and the approach is validated on real datasets (e.g., MovieLens and IPUMS) with open-source PyTorch implementations. Overall, the work advances practical, privacy-preserving synthetic relational data generation with strong statistical fidelity and referential integrity.
Abstract
Existing differentially private (DP) synthetic data generation mechanisms typically assume a single-source table. In practice, data is often distributed across multiple tables with relationships across tables. In this paper, we introduce the first-of-its-kind algorithm that can be combined with any existing DP mechanisms to generate synthetic relational databases. Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors in terms of low-order marginal distributions while maintaining referential integrity. This algorithm eliminates the need to flatten a relational database into a master table (saving space), operates efficiently (saving time), and scales effectively to high-dimensional data. We provide both DP and theoretical utility guarantees for our algorithm. Through numerical experiments on real-world datasets, we demonstrate the effectiveness of our method in preserving fidelity to the original data.
