Table of Contents
Fetching ...

Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang

TL;DR

WASP addresses the challenge of generating high-quality, differentially private synthetic data from limited private samples by fusing multiple pre-trained language models (PLMs) through a weighted, private Top-$Q$ voting mechanism and cross-PLM contrastive in-context learning. The method iteratively expands a DP synthetic dataset using adaptive PLM weights and contrastive prompts, while ensuring $(\epsilon,\delta)$-DP via Gaussian noise on voting histograms. Empirical results across six NLP tasks show WASP consistently outperforms baselines, demonstrates PLM-agnostic robustness, and scales to federated data settings, highlighting its practicality for privacy-preserving data synthesis in real-world deployments.

Abstract

Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.

Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

TL;DR

WASP addresses the challenge of generating high-quality, differentially private synthetic data from limited private samples by fusing multiple pre-trained language models (PLMs) through a weighted, private Top- voting mechanism and cross-PLM contrastive in-context learning. The method iteratively expands a DP synthetic dataset using adaptive PLM weights and contrastive prompts, while ensuring -DP via Gaussian noise on voting histograms. Empirical results across six NLP tasks show WASP consistently outperforms baselines, demonstrates PLM-agnostic robustness, and scales to federated data settings, highlighting its practicality for privacy-preserving data synthesis in real-world deployments.

Abstract

Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.

Paper Structure

This paper contains 28 sections, 5 theorems, 9 equations, 7 figures, 11 tables, 2 algorithms.

Key Result

Theorem 4.1

WASP (alg:algorithm_full_functions_singlePDP) satisfies $(\epsilon,\delta)$-DP.

Figures (7)

  • Figure 1: $(a)$ Comparison of the similarity of synthetic dataset to real private dataset (measured by FID heusel2017gans) and STM performance (numbers within parenthesis) of Aug-PE xie2024differentially (dotted lines) and our refinement (dashed lines) under $(4.0, 1\times10^{-5})$-DP with IMDb dataset. Lower FID indicates higher similarity. $(b)$ Results of Aug-PE using $100$ private samples and $(4.0, 1\times10^{-5})$-DP.
  • Figure 2: Overview of WASP framework.
  • Figure 3: Evaluation of downstream STM accuracy using Yelp-Rating dataset with $K=1,2,3$ closed-source PLMs, $L=1$ under $(4.0,1\times10^{-5})$-DP setting. In (b), results on the diagnose are with $K=1$ and others are with $K=2$.
  • Figure 4: Comparison of downstream STM accuracy using different number of private samples ($M$) from the training set of IMDb and Yelp-Rating datasets using $6$ open-source PLMs, $L=1$ with $(4.0,1\times10^{-5})$-DP.
  • Figure 5: Comparison of the resemblance of synthetic dataset to real private dataset (FID) using Aug-PE and our proposed WASP using movie review semantic analysis task and IMDb dataset.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 4.1
  • Theorem D.1
  • Theorem D.2
  • Lemma D.3
  • Theorem D.4