Table of Contents
Fetching ...

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong, Erin Babinsky, Alfy Samuel, Anoop Kumar, Robin Jia, Sai Praneeth Karimireddy

TL;DR

EPSVec is introduced, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors, and outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

Abstract

High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

TL;DR

EPSVec is introduced, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors, and outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.

Abstract

High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.
Paper Structure (56 sections, 5 theorems, 17 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 56 sections, 5 theorems, 17 equations, 9 figures, 7 tables, 2 algorithms.

Key Result

Theorem 4.1

For all $\varepsilon>0$ and $\delta\in(0,1)$, consider the dataset vectors $\{ v_\ell\}_{\ell=1}^L$ released by Algorithm alg:private-vector. If for each layer $\ell$, the noise scale satisfies then extracting the dataset vector $v_\ell$ is $(\varepsilon,\delta)$-DP and Algorithm alg:private-vector is $\left(L\varepsilon,\;L \delta\right)$-DP by basic composition.

Figures (9)

  • Figure 1: Overview of our method. We extract dataset vectors by distilling the private dataset into a compressed vector. Protected with enough noise, this vector can be injected into the hidden states of the LLM at inference time to efficiently generate any desired amount of synthetic data while ensuring differential privacy.
  • Figure 2: First two principal components of BioRxiv abstract embeddings. Model-generated points represent biology paper abstracts generated by zero-shot prompting of Llama-3.1-8B-Instruct, while other points show real paper abstracts.
  • Figure 3: Comparing human-written IMDb reviews and samples generated via zero-shot prompting with Llama-3.1-8B IT and PT models. (Left) IT model shows lower lexical diversity measured as the number of unique opening $3$-grams. (Right-Top) LLM-as-a-judge assigns higher quality scores to IT-generated reviews than to real reviews, reflecting excessive fluency and grammatical soundness. (Right-Bottom) BERT classifier trained on real reviews and tested on synthetic data (TRTS) achieves near-perfect accuracy on IT samples, suggesting these samples are overly simplistic.
  • Figure 4: Varying number of private samples consumed for vector construction in Yelp and $\varepsilon=5$. (Left) MAUVE increase is observed even when using a small number of private data. (Right) Synthetic text quality remains similar to private text quality for varying amount of private data used.
  • Figure 5: Runtime and sample efficiency for all methods on one A100 GPU with 80GB VRAM. (Left) Amortized runtime and number of required real data for generating $2$K synthetic samples on Yelp dataset, privacy budget $\varepsilon=5$. (Right) Maximum length of synthetic sample allowed given privacy budget.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 3.1: dwork2006our
  • Theorem 4.1: Privacy Guarantees of Dataset Vectors
  • Theorem 4.2: Privacy guarantee with subsampling
  • Lemma 1.1
  • Lemma 1.2
  • Lemma 1.3