Table of Contents
Fetching ...

Differentially Private Synthetic Data Generation for Relational Databases

Kaveh Alimohammadi, Hao Wang, Ojas Gulati, Akash Srivastava, Navid Azizan

TL;DR

The paper tackles differential privacy for synthetic relational data by avoiding master-table flattening and instead learning a bi-adjacency matrix that links individually DP-generated tables. It proves cross-table $k$-way marginals can be expressed as (fractional) linear functions of the bi-adjacency, enabling a scalable iterative optimization that targets worst-case marginal errors. The method combines a relaxed projected-gradient-descent solver with an unbiased recursive sampling scheme and a random-slicing scalability enhancement, all under DP budgets to guarantee privacy. Utility bounds are provided, along with convergence and runtime guarantees, and the approach is validated on real datasets (e.g., MovieLens and IPUMS) with open-source PyTorch implementations. Overall, the work advances practical, privacy-preserving synthetic relational data generation with strong statistical fidelity and referential integrity.

Abstract

Existing differentially private (DP) synthetic data generation mechanisms typically assume a single-source table. In practice, data is often distributed across multiple tables with relationships across tables. In this paper, we introduce the first-of-its-kind algorithm that can be combined with any existing DP mechanisms to generate synthetic relational databases. Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors in terms of low-order marginal distributions while maintaining referential integrity. This algorithm eliminates the need to flatten a relational database into a master table (saving space), operates efficiently (saving time), and scales effectively to high-dimensional data. We provide both DP and theoretical utility guarantees for our algorithm. Through numerical experiments on real-world datasets, we demonstrate the effectiveness of our method in preserving fidelity to the original data.

Differentially Private Synthetic Data Generation for Relational Databases

TL;DR

The paper tackles differential privacy for synthetic relational data by avoiding master-table flattening and instead learning a bi-adjacency matrix that links individually DP-generated tables. It proves cross-table -way marginals can be expressed as (fractional) linear functions of the bi-adjacency, enabling a scalable iterative optimization that targets worst-case marginal errors. The method combines a relaxed projected-gradient-descent solver with an unbiased recursive sampling scheme and a random-slicing scalability enhancement, all under DP budgets to guarantee privacy. Utility bounds are provided, along with convergence and runtime guarantees, and the approach is validated on real datasets (e.g., MovieLens and IPUMS) with open-source PyTorch implementations. Overall, the work advances practical, privacy-preserving synthetic relational data generation with strong statistical fidelity and referential integrity.

Abstract

Existing differentially private (DP) synthetic data generation mechanisms typically assume a single-source table. In practice, data is often distributed across multiple tables with relationships across tables. In this paper, we introduce the first-of-its-kind algorithm that can be combined with any existing DP mechanisms to generate synthetic relational databases. Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors in terms of low-order marginal distributions while maintaining referential integrity. This algorithm eliminates the need to flatten a relational database into a master table (saving space), operates efficiently (saving time), and scales effectively to high-dimensional data. We provide both DP and theoretical utility guarantees for our algorithm. Through numerical experiments on real-world datasets, we demonstrate the effectiveness of our method in preserving fidelity to the original data.
Paper Structure (34 sections, 24 theorems, 92 equations, 7 figures, 7 algorithms)

This paper contains 34 sections, 24 theorems, 92 equations, 7 figures, 7 algorithms.

Key Result

Lemma 1

For a relational database $(\mathcal{D}_1,\mathcal{D}_2,\bm{B})$ and any $k$-way cross-table query $q_{(\mathcal{S}_1, \mathcal{S}_2), (\bm{y}_1, \bm{y}_2)}$, let $\bm{1}_{(\mathcal{S}_1, \bm{y}_1)} \in \{0,1\}^{n_1}$ denote an indicator vector whose i-th element equals $1$ iff the $i$-th record in

Figures (7)

  • Figure 1: An illustration of the function $\mathsf{UBS}$ in Algorithm \ref{['alg::rec_ubs']}. Consider $\bm{x} = [0.1, 0.2, 0.5, 0.7, 0.6, 0.9]$ and $m = 3$. In the first merging step, we partition the indices of $\bm{x}$ into $L = 4$ groups: $\mathcal{G}[0] = \{1, 2, 3\}$, $\mathcal{G}[1] = \{4\}$, and so on. The corresponding group probabilities are $\mathsf{sum}(\bm{x}\mid \mathcal{G}[0]) = 0.8$, $\mathsf{sum}(\bm{x}\mid \mathcal{G}[1]) = 0.7$, etc. Next, we compute the complement of these probabilities and apply $\mathsf{UBS}([0.2, 0.3, 0.4, 0.1], 1)$ to select $L - m = 1$ group to exclude, which leads to the base case. We then sample from the probability vector $[0.2, 0.3, 0.4, 0.1]$; suppose we select group $\mathcal{G}[1]$, which is then excluded. After this, we sample exactly one index from $\mathcal{G}[0]$ according to the normalized probabilities $[0.1/0.8, 0.2/0.8, 0.5/0.8]$. Since $\mathcal{G}[2]$ and $\mathcal{G}[3]$ contain only one element each, those elements are included in the final output. The final output is the index set $\{3, 5, 6\}$.
  • Figure 2: Illustration of how the random slicing heuristic enhances the scalability of the main algorithm. A full version of the main algorithm, incorporating the random slicing heuristic, is provided in Appendix \ref{['append::main_alg']}.
  • Figure 3: We use AIM and MST to produce individual synthetic tables and then apply our algorithm with privacy budget $\epsilon_{\text{rel}}$ to establish their relationships. We illustrate the impact of varying $\epsilon_{\text{rel}}$ on the average $3$-way marginal query error for the MovieLens dataset (Left) and the IPUMS dataset (Right). As expected, the average error decreases as $\epsilon_{\text{rel}}$ increases, due to reduced noise.
  • Figure 4: We analyze the impact of varying the number of iterations $T$ in Algorithm \ref{['alg::adapt_alg_l2']} on the quality of synthetic data. The figures show a U-shaped curve between average error and $T$. This trend highlights a trade-off: increasing $T$ allows for more workload queries being selected but their answers, computed from real data, become increasingly noisy.
  • Figure 5: We show the effect of the hyperparameter $\alpha \in [0,1]$ on the quality of synthetic data produced by our algorithm. This parameter allocates the privacy budget between the Gaussian and exponential mechanisms.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Example 1
  • Theorem 4
  • Theorem 5
  • Definition 3
  • ...and 21 more