Table of Contents
Fetching ...

Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock

TL;DR

This work tackles DP-SDG under a vertical public-private data split, where public and private features coexist within the same dataset. It proposes a vertical framework that adapts horizontal public-assisted DP-SDG methods (GEM-Pub, PMW) as vGEM^Pub and vPMW, and also adapts JAM-PGM to produce vJAM-PGM, while introducing a conditional-generation alternative that uses public columns to sample private features via Private-PGM. Across experiments on Adult and Census datasets, vertical pretraining methods generally underperform compared to fully private AIM, whereas conditional generation—especially Conditional AIM—delivers the best utility, sometimes achieving zero error on public marginals and lowering private marginal error. The findings emphasize that, in practical vertical deployments, integrating public data through conditional generation offers the most promising path for high-utility DP-SDG, though scalability challenges in conditioning on many public columns motivate future work toward more efficient conditioning and generator-based approaches. Overall, the paper advances the understanding of how to leverage public attributes in vertically partitioned data to improve synthetic data quality under strong privacy guarantees.

Abstract

Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

TL;DR

This work tackles DP-SDG under a vertical public-private data split, where public and private features coexist within the same dataset. It proposes a vertical framework that adapts horizontal public-assisted DP-SDG methods (GEM-Pub, PMW) as vGEM^Pub and vPMW, and also adapts JAM-PGM to produce vJAM-PGM, while introducing a conditional-generation alternative that uses public columns to sample private features via Private-PGM. Across experiments on Adult and Census datasets, vertical pretraining methods generally underperform compared to fully private AIM, whereas conditional generation—especially Conditional AIM—delivers the best utility, sometimes achieving zero error on public marginals and lowering private marginal error. The findings emphasize that, in practical vertical deployments, integrating public data through conditional generation offers the most promising path for high-utility DP-SDG, though scalability challenges in conditioning on many public columns motivate future work toward more efficient conditioning and generator-based approaches. Overall, the paper advances the understanding of how to leverage public attributes in vertically partitioned data to improve synthetic data quality under strong privacy guarantees.

Abstract

Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

Paper Structure

This paper contains 27 sections, 1 theorem, 1 equation, 3 figures, 1 algorithm.

Key Result

Lemma A.3

If an algorithm $\mathcal{M}$ satisfies $\rho$-zCDP then it satisfies $(\varepsilon,\delta)$-DP for all $\varepsilon > 0$ with

Figures (3)

  • Figure 1: Varying $\varepsilon$ on Adult (red.). with $d_{\text{pub}} = 6$
  • Figure 2: Varying $\varepsilon$ on Adult with [$25\%, 50\%, 75\%$] of the columns being public.
  • Figure 3: Varying the percentage of public columns on Census data with $p \in \{10\%, 25\%, 50\%, 75\%, 90\%\}$ and $\varepsilon=1,5$

Theorems & Definitions (6)

  • Definition 3.1: $(\varepsilon, \delta)$-DP
  • Definition A.1: $(\varepsilon, \delta)$-DP
  • Definition A.2: $\rho$-zCDP
  • Lemma A.3: zCDP to DP canonne2020discrete
  • Definition A.4: Gaussian Mechanism
  • Definition A.5: Exponential Mechanism