CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources
Sikha Pentyala, Mayana Pereira, Martine De Cock
TL;DR
CaPS addresses the challenge of generating differentially private synthetic tabular data from distributed data holders without a trusted aggregator. It integrates differential privacy with secure multi-party computation (DP-in-MPC) to perform the select and measure steps inside MPC, while the generate step operates on DP-perturbed outputs and benefits from DP's post-processing guarantees, aiming to match centralized DP utility. The framework is modular and compatible with state-of-the-art marginal-based SDG algorithms such as AIM and MWEM+PGM, handling horizontal, vertical, or mixed data distributions. Experimental results on real datasets show CaPS can achieve utility close to centralized DP while providing input privacy, at the cost of MPC overhead, demonstrating the practical viability of privacy-preserving collaborative SDG as a service. The work lays a foundation for extending DP-in-MPC to a broader class of SDG algorithms and deployment scenarios without specialized hardware.
Abstract
Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.
