Bounding the Excess Risk for Linear Models Trained on Marginal-Preserving, Differentially-Private, Synthetic Data
Yvonne Zhou, Mingyu Liang, Ivan Brugere, Dana Dachman-Soled, Danial Dervovic, Antigoni Polychroniadou, Min Wu
TL;DR
The paper tackles the privacy risk of ML models by advocating preprocessing-based differential privacy through marginal-preserving synthetic data. It provides a rigorous end-to-end analysis, deriving upper and lower bounds on the excess empirical risk for linear models trained on synthetic data that approximately preserves low-order marginals, with tighter guarantees for logistic regression. A DP mechanism Gen_{d,σ} is proposed to generate such synthetic data, linking privacy parameters to utility via the marginal distance $\nu$, and a matching lower bound demonstrates near-optimality under certain regimes. Empirically, using the AIM marginal-preserving DP data on six public datasets yields minimal utility loss (often <1-2%) and small excess risk, while enabling training on synthetic data without additional privacy budget burden. The work advances practical DP ML by showing how preserving marginals can sustain performance while offering scalable, reusable, private data for downstream tasks.
Abstract
The growing use of machine learning (ML) has raised concerns that an ML model may reveal private information about an individual who has contributed to the training dataset. To prevent leakage of sensitive data, we consider using differentially-private (DP), synthetic training data instead of real training data to train an ML model. A key desirable property of synthetic data is its ability to preserve the low-order marginals of the original distribution. Our main contribution comprises novel upper and lower bounds on the excess empirical risk of linear models trained on such synthetic data, for continuous and Lipschitz loss functions. We perform extensive experimentation alongside our theoretical results.
