Table of Contents
Fetching ...

Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

Zinan Lin, Tadas Baltrusaitis, Wenyu Wang, Sergey Yekhanin

TL;DR

This work broadens DP synthetic data generation by generalizing Private Evolution (0.9PE) from foundation-model inferences to non-neural simulators, creating 0.9Sim-PE. The approach preserves the core RANDOM_API and VARIATION_API interface, enabling simulator backends (and optionally hybrids with foundation models) to generate DP data without training. Across MNIST, CelebA, and CIFAR10, 0.9Sim-PE with simulator backends yields up to 3x gains in downstream accuracy and up to 80% FID reduction, while offering substantial efficiency improvements over model-based counterparts. The results demonstrate that simulators can significantly expand the practical reach of DP data synthesis, and hybrid schemes that combine simulators with foundation models can further boost performance; the authors release code in the Private Evolution Python library.

Abstract

Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.

Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model

TL;DR

This work broadens DP synthetic data generation by generalizing Private Evolution (0.9PE) from foundation-model inferences to non-neural simulators, creating 0.9Sim-PE. The approach preserves the core RANDOM_API and VARIATION_API interface, enabling simulator backends (and optionally hybrids with foundation models) to generate DP data without training. Across MNIST, CelebA, and CIFAR10, 0.9Sim-PE with simulator backends yields up to 3x gains in downstream accuracy and up to 80% FID reduction, while offering substantial efficiency improvements over model-based counterparts. The results demonstrate that simulators can significantly expand the practical reach of DP data synthesis, and hybrid schemes that combine simulators with foundation models can further boost performance; the authors release code in the Private Evolution Python library.

Abstract

Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.

Paper Structure

This paper contains 36 sections, 4 equations, 27 figures, 10 tables, 2 algorithms.

Figures (27)

  • Figure 1: Unlike prior 0.9PE work that relies solely on foundation models, we show that 0.9PE is also compatible with non-neural-network data synthesis tools, which we call simulators. This greatly broadens 0.9PE's applicability and enables SoTA simulators for DP data synthesis.
  • Figure 2: Real (private) images
  • Figure 3: Simulator-generated images
  • Figure 4: 0.9Sim-PE images ($\epsilon=10$)
  • Figure 6: Real (private) images
  • ...and 22 more figures