CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Sikha Pentyala; Mayana Pereira; Martine De Cock

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Sikha Pentyala, Mayana Pereira, Martine De Cock

TL;DR

CaPS addresses the challenge of generating differentially private synthetic tabular data from distributed data holders without a trusted aggregator. It integrates differential privacy with secure multi-party computation (DP-in-MPC) to perform the select and measure steps inside MPC, while the generate step operates on DP-perturbed outputs and benefits from DP's post-processing guarantees, aiming to match centralized DP utility. The framework is modular and compatible with state-of-the-art marginal-based SDG algorithms such as AIM and MWEM+PGM, handling horizontal, vertical, or mixed data distributions. Experimental results on real datasets show CaPS can achieve utility close to centralized DP while providing input privacy, at the cost of MPC overhead, demonstrating the practical viability of privacy-preserving collaborative SDG as a service. The work lays a foundation for extending DP-in-MPC to a broader class of SDG algorithms and deployment scenarios without specialized hardware.

Abstract

Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 2 figures, 5 tables, 11 algorithms)

This paper contains 21 sections, 2 equations, 2 figures, 5 tables, 11 algorithms.

Introduction
Preliminaries
CaPS: Collaborative and Private SDG
Setup Phase
Computation of Answers on Distributed Data
Selection of the Query
Measuring Answer to the Selected Query
Generation of Synthetic Data
Note on Modularity
Experimental Evaluation
Related Work
Conclusion
MPC protocols.
Extending Protocol 2 to compute $p$-way marginals.
Discussion on common datasets.
...and 6 more sections

Figures (2)

Figure 1: $\texttt{CaPS}$: A framework that leverages 'DP-in-MPC' to collaboratively and privately generate tabular synthetic data using marginal-based SDG techniques with the 'select-measure-generate' template. Servers run MPC protocols for 'select' and 'measure'. The 'generate' step is performed over differentially private measurements.
Figure 2: Scalability of $\pi_{\mathsf{COMP}}$ in a 3PC passive setting. MPC protocols are run with $M=3,d=10,|Q|=36,max(\omega_q)=25$. On left: Scalability of $\pi_{\mathsf{COMP}}$ for different number of total dataset size $n$. On right: Scalability of $\pi_{\mathsf{COMP}}$ for different number of data holders $N$.

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

TL;DR

Abstract

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Authors

TL;DR

Abstract

Table of Contents

Figures (2)