PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

Amir Asiaee; Chao Yan; Zachary B. Abrams; Bradley A. Malin

PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

Amir Asiaee, Chao Yan, Zachary B. Abrams, Bradley A. Malin

TL;DR

PRISM introduces a prediction-centric approach to differentially private synthetic data by allocating the privacy budget toward predictive structure in three regimes: causal parents for robustness under shift, Markov blanket for graphical sufficiency in fixed distributions, and data-driven predictive feature selection when no structure is known. It formalizes this as an end-to-end pipeline that maps private data $D$ to a synthetic $ ilde{D}$ which downstream learners use to produce predictions, with a distributional surrogate based on the total variation distance to bound excess risk. The mechanism uses task-aware workload construction and a closed-form budget allocation to prioritize task-critical statistics, and it leverages Private-PGM for graphical-model-based synthesis. Empirically, PRISM shows improved predictive accuracy over generic DP syntheses, especially under distribution shift, and demonstrates competitive performance on the Adult dataset when targeting a single predictive task. This work enables practical, privacy-preserving data sharing tailored to downstream predictive goals, with principled guarantees and clear regime-based guidelines for feature targeting and budget allocation.

Abstract

Differential privacy (DP) provides a mathematical guarantee limiting what an adversary can learn about any individual from released data. However, achieving this protection typically requires adding noise, and noise can accumulate when many statistics are measured. Existing DP synthetic data methods treat all features symmetrically, spreading noise uniformly even when the data will serve a specific prediction task. We develop a prediction-centric approach operating in three regimes depending on available structural knowledge. In the causal regime, when the causal parents of $Y$ are known and distribution shift is expected, we target the parents for robustness. In the graphical regime, when a Bayesian network structure is available and the distribution is stable, the Markov blanket of $Y$ provides a sufficient feature set for optimal prediction. In the predictive regime, when no structural knowledge exists, we select features via differentially private methods without claiming to recover causal or graphical structure. We formalize this as PRISM, a mechanism that (i) identifies a predictive feature subset according to the appropriate regime, (ii) constructs targeted summary statistics, (iii) allocates budget to minimize an upper bound on prediction error, and (iv) synthesizes data via graphical-model inference. We prove end-to-end privacy guarantees and risk bounds. Empirically, task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC $\approx 0.73$ while correlation-based selection collapses to chance ($\approx 0.49$).

PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

TL;DR

to a synthetic

which downstream learners use to produce predictions, with a distributional surrogate based on the total variation distance to bound excess risk. The mechanism uses task-aware workload construction and a closed-form budget allocation to prioritize task-critical statistics, and it leverages Private-PGM for graphical-model-based synthesis. Empirically, PRISM shows improved predictive accuracy over generic DP syntheses, especially under distribution shift, and demonstrates competitive performance on the Adult dataset when targeting a single predictive task. This work enables practical, privacy-preserving data sharing tailored to downstream predictive goals, with principled guarantees and clear regime-based guidelines for feature targeting and budget allocation.

Abstract

are known and distribution shift is expected, we target the parents for robustness. In the graphical regime, when a Bayesian network structure is available and the distribution is stable, the Markov blanket of

provides a sufficient feature set for optimal prediction. In the predictive regime, when no structural knowledge exists, we select features via differentially private methods without claiming to recover causal or graphical structure. We formalize this as PRISM, a mechanism that (i) identifies a predictive feature subset according to the appropriate regime, (ii) constructs targeted summary statistics, (iii) allocates budget to minimize an upper bound on prediction error, and (iv) synthesizes data via graphical-model inference. We prove end-to-end privacy guarantees and risk bounds. Empirically, task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC

while correlation-based selection collapses to chance (

Paper Structure (105 sections, 8 theorems, 28 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 105 sections, 8 theorems, 28 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Why budget matters.
A prediction-centric alternative.
Identifying the right features.
An end-to-end perspective.
What distinguishes our approach.
Contributions.
Scope.
Related Work
Preliminaries
Data model and notation
Differential privacy
Causal graphs and Markov blanket
Problem Formulation
A pipeline objective: DP synthesis for downstream learning
...and 90 more sections

Key Result

Theorem 6.1

Assume Step 1 (subset selection) is $(\varepsilon_{\mathrm{sel}},0)$-DP, Step 2 (MI estimation for 3-way marginal selection) is $(\varepsilon_{\mathrm{mi}},0)$-DP, and Step 4 measures each query $q\in\mathcal{W}$ with a mechanism that is $(\varepsilon_q,\delta_q)$-DP. Then the overall mechanism in A

Figures (5)

Figure 1: SCM spurious shift: TSTR ROC-AUC vs. $\varepsilon$. PRISM-Causal (Regime 1) remains robust; correlation-based methods collapse because they rely on spurious children. Error bars: 95% CI; MST/PrivBayes at $\varepsilon=1.0$ only.
Figure 2: SCM marginal shift: TSTR ROC-AUC vs. $\varepsilon$. Child features remain predictive, so blanket preservation helps. Error bars: 95% CI; MST/PrivBayes at $\varepsilon=1.0$ only.
Figure 3: Adult income: TSTR ROC-AUC vs. $\varepsilon$. Workload-based variants are similar on this large dataset; MST/PrivBayes lag. Error bars: 95% CI; MST/PrivBayes at $\varepsilon=1.0$ only.
Figure 4: Adult Income: 1-way marginal L1 error vs. $\varepsilon$ (mean with 95% CI).
Figure 5: Allocation wins: closed-form allocation vs. uniform on a heterogeneous-importance Naive Bayes dataset.

Theorems & Definitions (26)

Definition 3.1: Differential Privacy dwork2006calibratingdwork2014algorithmic
Definition 4.1: Prediction-centric DP synthesis objective
Definition 4.2: Task marginal distance
Theorem 6.1: DP of PRISM
Remark 6.1
Proposition 6.1: Markov blanket sufficiency for prediction
Remark 6.2: Graphical vs. causal sufficiency
Lemma 6.1: Risk depends only on $(X_S,Y)$ marginal
Lemma 6.2: Total variation controls risk difference
Theorem 6.2: Excess risk of ERM on synthetic data
...and 16 more

PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

TL;DR

Abstract

PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (26)