Does Training with Synthetic Data Truly Protect Privacy?
Yunpeng Zhao, Jie Zhang
TL;DR
This work critically evaluates whether training with synthetic data can protect privacy by auditing four empirical paradigms—coreset selection, dataset distillation, data-free knowledge distillation, and diffusion-model–generated data—against membership inference attacks. Using a worst-case, LiRA-inspired evaluation framework on CIFAR-10 and comparing to DPSGD, the authors demonstrate that none of the empirical methods surpass differential privacy in the privacy-utility-efficiency tradeoff, and some exhibit strong privacy leakage on vulnerable samples or via visual memorization despite low MI signals. The study reveals that initialization choices, memorization dynamics, and visual leakage can undermine privacy protections, even when synthetic data appear dissimilar to private data. The results call for rigorous evaluation standards and reproducibility, arguing that practical privacy guarantees require formal DP methods rather than reliance on empirical defenses alone.
Abstract
As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.
