Table of Contents
Fetching ...

Bounding the Excess Risk for Linear Models Trained on Marginal-Preserving, Differentially-Private, Synthetic Data

Yvonne Zhou, Mingyu Liang, Ivan Brugere, Dana Dachman-Soled, Danial Dervovic, Antigoni Polychroniadou, Min Wu

TL;DR

The paper tackles the privacy risk of ML models by advocating preprocessing-based differential privacy through marginal-preserving synthetic data. It provides a rigorous end-to-end analysis, deriving upper and lower bounds on the excess empirical risk for linear models trained on synthetic data that approximately preserves low-order marginals, with tighter guarantees for logistic regression. A DP mechanism Gen_{d,σ} is proposed to generate such synthetic data, linking privacy parameters to utility via the marginal distance $\nu$, and a matching lower bound demonstrates near-optimality under certain regimes. Empirically, using the AIM marginal-preserving DP data on six public datasets yields minimal utility loss (often <1-2%) and small excess risk, while enabling training on synthetic data without additional privacy budget burden. The work advances practical DP ML by showing how preserving marginals can sustain performance while offering scalable, reusable, private data for downstream tasks.

Abstract

The growing use of machine learning (ML) has raised concerns that an ML model may reveal private information about an individual who has contributed to the training dataset. To prevent leakage of sensitive data, we consider using differentially-private (DP), synthetic training data instead of real training data to train an ML model. A key desirable property of synthetic data is its ability to preserve the low-order marginals of the original distribution. Our main contribution comprises novel upper and lower bounds on the excess empirical risk of linear models trained on such synthetic data, for continuous and Lipschitz loss functions. We perform extensive experimentation alongside our theoretical results.

Bounding the Excess Risk for Linear Models Trained on Marginal-Preserving, Differentially-Private, Synthetic Data

TL;DR

The paper tackles the privacy risk of ML models by advocating preprocessing-based differential privacy through marginal-preserving synthetic data. It provides a rigorous end-to-end analysis, deriving upper and lower bounds on the excess empirical risk for linear models trained on synthetic data that approximately preserves low-order marginals, with tighter guarantees for logistic regression. A DP mechanism Gen_{d,σ} is proposed to generate such synthetic data, linking privacy parameters to utility via the marginal distance , and a matching lower bound demonstrates near-optimality under certain regimes. Empirically, using the AIM marginal-preserving DP data on six public datasets yields minimal utility loss (often <1-2%) and small excess risk, while enabling training on synthetic data without additional privacy budget burden. The work advances practical DP ML by showing how preserving marginals can sustain performance while offering scalable, reusable, private data for downstream tasks.

Abstract

The growing use of machine learning (ML) has raised concerns that an ML model may reveal private information about an individual who has contributed to the training dataset. To prevent leakage of sensitive data, we consider using differentially-private (DP), synthetic training data instead of real training data to train an ML model. A key desirable property of synthetic data is its ability to preserve the low-order marginals of the original distribution. Our main contribution comprises novel upper and lower bounds on the excess empirical risk of linear models trained on such synthetic data, for continuous and Lipschitz loss functions. We perform extensive experimentation alongside our theoretical results.
Paper Structure (38 sections, 18 theorems, 27 equations, 5 figures, 2 tables, 6 algorithms)

This paper contains 38 sections, 18 theorems, 27 equations, 5 figures, 2 tables, 6 algorithms.

Key Result

Theorem 2.3

Suppose $a\leq 0<1\leq b$, and $f$ is a continuous function on $[a, b]$, for $d=1, 2, ...$, where $\omega$ is the modulus of continuity of $f$ on $[a, b]$. Additionally, let $P_df(x)=\sum_{k=0}^d a_{dk}x^k$, then for $d=1, 2, ...,$

Figures (5)

  • Figure 1: (a) shows Minimax approximation for $\log(\text{sigmoid(x)})$ function within interval $[-5, 5]$ in 4-degree polynomial: $\log(\text{sigmoid(x)})_{minimax}\approx 0.71-0.5x+ 0.1096 x^2-0.0015 x^4$, with an error of 0.061. (b) shows the iterated Bernstein Approximations for $\log(\text{sigmoid(x)})$ function within interval $[-5, 5]$ in 4-degree polynomial by iterate Bernstein approximation for 1 time, 4 times, and 9 times: $\log(\text{sigmoid(x)})_{Bern_1}\approx 1.2377- 0.5x+0.0544x^2-0.0001x^4$, with an error of 0.545; $\log(\text{sigmoid(x)})_{Bern_4}\approx 0.7934- 0.5x+0.0812x^2-0.0005x^4$, with an error of 0.100; $\log(\text{sigmoid(x)})_{Bern_9}\approx 0.7504- 0.5x+0.0931x^2-0.0009x^4$, with an error of 0.057.
  • Figure 2: We compare the $L_1$ error of synthetic data using AIM mechanism for all six(6) datasets with different privacy budget.
  • Figure 3: We generated synthetic data for the six(6) datasets with $\epsilon \in(\frac{1}{4}, \frac{2}{4}, \frac{3}{4}, 1, \frac{5}{4}, \frac{6}{4}, \frac{7}{4}, 2)$. We produce 10 randomized sets of synthetic data for each $\epsilon$. We assess performance by training the machine learning model 10 times with randomly split datasets to 80% training, 20% testing. Note that some degree of minor unpredictability is inevitable due to the limited number of trials, and this causes the slight graph oscillation.
  • Figure 4: We train the six dataset with DP-SGD approach that was described as Algorithm \ref{['alg:DP-SGD']}, incorporating a gradient norm clipping threshold as 1, and differential privacy budget, epsilon=1. Specifically, we select the learning rate from {1, 5}, running step T from {300, 500, 1000}, decay rate from {0.1, 0.5}, and batch size from{20, 100, 200, 500, 1000, 3000}. Additionally, we train another DP method, PATE-learning, based on papernot2017semisupervised. For each dataset, we consider three different teacher numbers chosen from {10, 15, 20, 100, 150, 200, 300, 450, 800}. The figure illustrates a comparison of accuracy using various differential privacy methods, which includes Non-DP, AIM (generated DP synthetic data), DP-SGD, PATE learning (with 3 teacher numbers), respectively.
  • Figure 5: We train the three classifier models on each dataset and their synthetic data generated by AIM with privacy budget, epsilon=1. Dataset {Adult, Churn, Law}, three models are trained to classify three different target features: Dutch: {'occupation', 'prev_residence_place', 'sex'}, Adult: {'income>50K', 'sex', 'relationship}, Law: {'pass_bar', 'race', 'fulltime'}. real_1 and aim_1 show results when classifying the first feature, and trained on real data, synthetic data from AIM, respectively; real_2 and aim_2 show results when classifying the 2nd feature, and trained on real data, synthetic data from AIM, respectively; real_3 and aim_3 show results when classifying the 3rd feature, and trained on real data, synthetic data from AIM, respectively.

Theorems & Definitions (32)

  • Definition 2.1: Marginal of Dataset
  • Definition 2.2: Bernstein Polynomial Approximation
  • Theorem 2.3: ROULIER1970117_bernstein, Th. 1 and bernstein_r_0, Th. 1.6.1
  • Theorem 2.4: bernstein_rth_order
  • Definition 2.5: $(\epsilon, \delta)$-Differential Privacy
  • Theorem 2.6: Gaussian Mechanism
  • Theorem 2.7: Post-Processing
  • Theorem 3.1
  • Theorem 3.2
  • Lemma 3.3
  • ...and 22 more