GEM+: Scalable State-of-the-Art Private Synthetic Data with Generator Networks
Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock
TL;DR
Our work tackles DP-SDG for high-dimensional tabular data, where traditional select-measure-generate methods using graphical models scale poorly. We propose GEM+, which marries AIM's adaptive marginal selection and budget allocation with GEM's scalable generator networks, incorporating enhancements like workload closure, marginal closure, and filtered candidate selection. We demonstrate on the Criteo Ads dataset with up to 120 columns that GEM+ achieves state-of-the-art utility under ($\epsilon$, $\delta$)-DP (equivalently $\rho$-zCDP) while maintaining tractable runtimes, significantly outperforming AIM and GEM in high-dimensional settings. This work enables practical private synthetic data at web scale and highlights the importance of adaptive, scalable DP-SDG design.
Abstract
State-of-the-art differentially private synthetic tabular data has been defined by adaptive 'select-measure-generate' frameworks, exemplified by methods like AIM. These approaches iteratively measure low-order noisy marginals and fit graphical models to produce synthetic data, enabling systematic optimisation of data quality under privacy constraints. Graphical models, however, are inefficient for high-dimensional data because they require substantial memory and must be retrained from scratch whenever the graph structure changes, leading to significant computational overhead. Recent methods, like GEM, overcome these limitations by using generator neural networks for improved scalability. However, empirical comparisons have mostly focused on small datasets, limiting real-world applicability. In this work, we introduce GEM+, which integrates AIM's adaptive measurement framework with GEM's scalable generator network. Our experiments show that GEM+ outperforms AIM in both utility and scalability, delivering state-of-the-art results while efficiently handling datasets with over a hundred columns, where AIM fails due to memory and computational overheads.
