Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web
Vishal Ramesh, Rui Zhao, Naman Goel
TL;DR
The paper addresses scalable, privacy-preserving synthetic data generation in a decentralised Web setting. It introduces a hybrid architecture that combines Libertas' MPC framework with Intel SGX secure enclaves to perform heavy-weight DP synthesis, reducing the computation and communication burden while preserving contributor autonomy. Empirical results on simulated and real datasets demonstrate orders-of-magnitude improvements in performance and data transfer over MPC-only baselines, maintaining differential privacy guarantees. This work advances practical decentralised synthetic data workflows, enabling trustworthy AI while avoiding centralised data custodians and enhancing trust among data contributors.
Abstract
Data on the Web has fueled much of the recent progress in AI. As more high-quality data becomes difficult to access, synthetic data is emerging as a promising solution for privacy-friendly data release and complementing real datasets in developing robust and safe AI. But there is limited work on decentralised, scalable and contributor-centric synthetic data generation systems. A recent proposal, called Libertas, allows data contributors to autonomously participate in joint computations over their Web data without relying on a trusted centre. Libertas uses Solid (Social Linked Data) and MPC (Secure Multi-Party Computation) to achieve this goal. Solid is a decentralised Web specification that lets anyone store their data securely in their personal decentralised data stores called Pods and control which applications have access to their data. MPC refers to the set of cryptographic methods for different parties to jointly compute a function over their inputs while keeping those inputs private. Thus, Libertas can also be used to generate synthetic data from otherwise inaccessible Web data in a responsible way, by ensuring contributor autonomy, decentralisation and privacy. However, the scalability of this system remains limited due to the high computation and communication costs in MPC. In this paper, we show how one can improve Libertas using secure enclaves (in addition to MPC) to address the scalability challenge. Secure enclaves such as Intel SGX rely on hardware based features for confidentiality and integrity of code and data. We discuss a principled approach for integrating SGX within the Libertas architecture for scalable differentially private synthetic data generation, and support our analysis with rigorous empirical results on simulated and real datasets and different synthetic data generation algorithms.
