Table of Contents
Fetching ...

SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

Md Mahadi Hasan Nahid, Sadid Bin Hasan

TL;DR

This work tackles privacy in data-driven ML by introducing SafeSynthDP, a training-free pipeline that generates DP-enhanced synthetic data via Large Language Models. By combining in-context demonstrations with Laplace or Gaussian noise governed by a privacy budget $\epsilon$, the approach aims to preserve key data statistics while protecting individual privacy. Empirical results on AGNews show that DP-augmented synthetic data can support a range of ML models and LLM in-context learning, though there is a measurable trade-off between privacy and utility, especially for complex models. The study establishes a foundational methodology for privacy-preserving synthetic data using LLMs and outlines practical directions to improve utility without compromising privacy in sensitive domains.

Abstract

Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.

SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

TL;DR

This work tackles privacy in data-driven ML by introducing SafeSynthDP, a training-free pipeline that generates DP-enhanced synthetic data via Large Language Models. By combining in-context demonstrations with Laplace or Gaussian noise governed by a privacy budget , the approach aims to preserve key data statistics while protecting individual privacy. Empirical results on AGNews show that DP-augmented synthetic data can support a range of ML models and LLM in-context learning, though there is a measurable trade-off between privacy and utility, especially for complex models. The study establishes a foundational methodology for privacy-preserving synthetic data using LLMs and outlines practical directions to improve utility without compromising privacy in sensitive domains.

Abstract

Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.
Paper Structure (32 sections, 1 figure, 5 tables)

This paper contains 32 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Workflow for Generating Privacy-Preserving Synthetic Data (SafeSynthDP). This diagram illustrates the process from selecting in-context examples from the original dataset, through the generation of initial synthetic data using the gpt-4o-mini LLM, to enhancing privacy through the addition of Laplace or Gaussian noise and adjusting the privacy level via the $\epsilon$ parameter, culminating in the evaluation of the privacy-enhanced synthetic data.