Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock
TL;DR
This work tackles DP-SDG under a vertical public-private data split, where public and private features coexist within the same dataset. It proposes a vertical framework that adapts horizontal public-assisted DP-SDG methods (GEM-Pub, PMW) as vGEM^Pub and vPMW, and also adapts JAM-PGM to produce vJAM-PGM, while introducing a conditional-generation alternative that uses public columns to sample private features via Private-PGM. Across experiments on Adult and Census datasets, vertical pretraining methods generally underperform compared to fully private AIM, whereas conditional generation—especially Conditional AIM—delivers the best utility, sometimes achieving zero error on public marginals and lowering private marginal error. The findings emphasize that, in practical vertical deployments, integrating public data through conditional generation offers the most promising path for high-utility DP-SDG, though scalability challenges in conditioning on many public columns motivate future work toward more efficient conditioning and generator-based approaches. Overall, the paper advances the understanding of how to leverage public attributes in vertically partitioned data to improve synthetic data quality under strong privacy guarantees.
Abstract
Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.
