Table of Contents
Fetching ...

Does Training with Synthetic Data Truly Protect Privacy?

Yunpeng Zhao, Jie Zhang

TL;DR

This work critically evaluates whether training with synthetic data can protect privacy by auditing four empirical paradigms—coreset selection, dataset distillation, data-free knowledge distillation, and diffusion-model–generated data—against membership inference attacks. Using a worst-case, LiRA-inspired evaluation framework on CIFAR-10 and comparing to DPSGD, the authors demonstrate that none of the empirical methods surpass differential privacy in the privacy-utility-efficiency tradeoff, and some exhibit strong privacy leakage on vulnerable samples or via visual memorization despite low MI signals. The study reveals that initialization choices, memorization dynamics, and visual leakage can undermine privacy protections, even when synthetic data appear dissimilar to private data. The results call for rigorous evaluation standards and reproducibility, arguing that practical privacy guarantees require formal DP methods rather than reliance on empirical defenses alone.

Abstract

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

Does Training with Synthetic Data Truly Protect Privacy?

TL;DR

This work critically evaluates whether training with synthetic data can protect privacy by auditing four empirical paradigms—coreset selection, dataset distillation, data-free knowledge distillation, and diffusion-model–generated data—against membership inference attacks. Using a worst-case, LiRA-inspired evaluation framework on CIFAR-10 and comparing to DPSGD, the authors demonstrate that none of the empirical methods surpass differential privacy in the privacy-utility-efficiency tradeoff, and some exhibit strong privacy leakage on vulnerable samples or via visual memorization despite low MI signals. The study reveals that initialization choices, memorization dynamics, and visual leakage can undermine privacy protections, even when synthetic data appear dissimilar to private data. The results call for rigorous evaluation standards and reproducibility, arguing that practical privacy guarantees require formal DP methods rather than reliance on empirical defenses alone.

Abstract

As synthetic data becomes increasingly popular in machine learning tasks, numerous methods--without formal differential privacy guarantees--use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms: coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. We caution that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

Paper Structure

This paper contains 41 sections, 4 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: A rigorous evaluation of privacy leakage in models trained with synthetic data. We compare the privacy-utility tradeoff and efficiency of four training paradigms---coreset selection, dataset distillation (DD), data-free knowledge distillation (DFKD), and synthetic data generated from diffusion models---against DPSGD.
  • Figure 2: We evaluate the privacy leakage of private training data in the worst-case scenario for each training paradigm, only interacting with the final model trained on synthetic data.
  • Figure 3: Failing to report privacy leakage on the most vulnerable data provides a false sense of privacy. We investigate three different defenses: one based on coreset selection and two based on dataset distillation.
  • Figure 4: ML models tend to strongly memorize the most vulnerable data. We demonstrate this by presenting the loss distribution for both members and non-members, comparing average-case data with worst-case data.
  • Figure 5: Coreset selection does not guarantee privacy protection, both random selection and forgetting result in significant privacy leakage. The TPR at 0.1% FPR for forgetting is 72.94% while it is 38.70% for random selection.
  • ...and 8 more figures