Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel; Anil A Bharath; Zilong Zhao; Kevin Yee

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel, Anil A Bharath, Zilong Zhao, Kevin Yee

TL;DR

This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text.

Abstract

Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($ε\leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

TL;DR

Abstract

Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)