Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web

Vishal Ramesh; Rui Zhao; Naman Goel

Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web

Vishal Ramesh, Rui Zhao, Naman Goel

TL;DR

The paper addresses scalable, privacy-preserving synthetic data generation in a decentralised Web setting. It introduces a hybrid architecture that combines Libertas' MPC framework with Intel SGX secure enclaves to perform heavy-weight DP synthesis, reducing the computation and communication burden while preserving contributor autonomy. Empirical results on simulated and real datasets demonstrate orders-of-magnitude improvements in performance and data transfer over MPC-only baselines, maintaining differential privacy guarantees. This work advances practical decentralised synthetic data workflows, enabling trustworthy AI while avoiding centralised data custodians and enhancing trust among data contributors.

Abstract

Data on the Web has fueled much of the recent progress in AI. As more high-quality data becomes difficult to access, synthetic data is emerging as a promising solution for privacy-friendly data release and complementing real datasets in developing robust and safe AI. But there is limited work on decentralised, scalable and contributor-centric synthetic data generation systems. A recent proposal, called Libertas, allows data contributors to autonomously participate in joint computations over their Web data without relying on a trusted centre. Libertas uses Solid (Social Linked Data) and MPC (Secure Multi-Party Computation) to achieve this goal. Solid is a decentralised Web specification that lets anyone store their data securely in their personal decentralised data stores called Pods and control which applications have access to their data. MPC refers to the set of cryptographic methods for different parties to jointly compute a function over their inputs while keeping those inputs private. Thus, Libertas can also be used to generate synthetic data from otherwise inaccessible Web data in a responsible way, by ensuring contributor autonomy, decentralisation and privacy. However, the scalability of this system remains limited due to the high computation and communication costs in MPC. In this paper, we show how one can improve Libertas using secure enclaves (in addition to MPC) to address the scalability challenge. Secure enclaves such as Intel SGX rely on hardware based features for confidentiality and integrity of code and data. We discuss a principled approach for integrating SGX within the Libertas architecture for scalable differentially private synthetic data generation, and support our analysis with rigorous empirical results on simulated and real datasets and different synthetic data generation algorithms.

Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web

TL;DR

Abstract

Paper Structure (29 sections, 10 figures, 2 tables)

This paper contains 29 sections, 10 figures, 2 tables.

Introduction
Problem Description
Proposed Approach for Scalable Decentralised Synthetic Data Generation
Steps
Separating Noise Addition and Generation from MPC
Random Selection
Secure Enclave
Discussion
Assumptions
Empirical Evaluation
Implementation Details
MP-SPDZ
Gramine
Marginal-Based Inference (MBI)
Experimental Setting
...and 14 more sections

Figures (10)

Figure 1: For generating differentially-private synthetic data from personal data stored in Solid pods, we adapt the Libertas architecture such that MPC is used for histogram aggregation and nominating a random enclave agent only. Subsequent steps of synthetic data generation are executed in the enclave (Intel SGX) after remote attestation.
Figure 2: Comparison of MPC Only and MPC+SGX Approaches [MWEM; MASCOT protocol, fixed total data (simulated) i.e. 10000 data points divided equally among data providers; 10 bins, $\epsilon = 2$, $T=30$.]
Figure 3: Performance of the MPC+SGX approach using PGM and Local Consistency generation algorithms (30 iterations); Adult dataset; SHAMIR protocol, one data point per provider.]
Figure 4: Comparison of MPC Only and MPC+SGX Approaches [MWEM; MASCOT protocol, variable total data (simulated) i.e. 100 data points per provider; 10 bins, $\epsilon = 2$, $T=30$.]
Figure 5: Comparison of time in MPC Only and MPC+SGX Approaches, for different number of iterations of MWEM. MASCOT protocol, 100 data providers, 10 bins, $\epsilon = 2$, $T=100$.
...and 5 more figures

Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web

TL;DR

Abstract

Scalable and Privacy-Preserving Synthetic Data Generation on Decentralised Web

Authors

TL;DR

Abstract

Table of Contents

Figures (10)