Table of Contents
Fetching ...

Private Text Generation by Seeding Large Language Model Prompts

Supriya Nagesh, Justin Y. Chen, Nina Mishra, Tal Wagner

TL;DR

This work tackles the challenge of sharing private text data for machine learning by enabling private synthetic text generation through prompting a large language model, without exposing sensitive data. It introduces DP-KPS, which seeds LLM prompts with sequences of privatized keyphrases drawn from a DP KDE over private embeddings, allocating privacy budgets $\varepsilon_{\mathrm{voc}}$ and $\varepsilon_{\mathrm{kde}}$ (with total $\varepsilon_{\mathrm{total}}=\varepsilon_{\mathrm{voc}}+\varepsilon_{\mathrm{kde}}$). The approach combines a privatized vocabulary, high-dimensional DP KDE sampling, and either independent or iterative keyphrase sequence generation, followed by LLM prompting and domain adaptation to produce DP-compliant synthetic texts. Empirical results on MIMIC medical notes and DBPedia-14 demonstrate that DP-KPS can preserve substantial predictive power under DP constraints and outperform certain prompt-based baselines while using far fewer LLM prompts; ablations show the benefits of domain adaptation and few-shot prompting. Overall, DP-KPS offers a practical pathway for privacy-preserving data sharing in regulated domains, enabling downstream ML while acknowledging risks like hallucinations and distribution shifts that require careful validation and governance.

Abstract

We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.

Private Text Generation by Seeding Large Language Model Prompts

TL;DR

This work tackles the challenge of sharing private text data for machine learning by enabling private synthetic text generation through prompting a large language model, without exposing sensitive data. It introduces DP-KPS, which seeds LLM prompts with sequences of privatized keyphrases drawn from a DP KDE over private embeddings, allocating privacy budgets and (with total ). The approach combines a privatized vocabulary, high-dimensional DP KDE sampling, and either independent or iterative keyphrase sequence generation, followed by LLM prompting and domain adaptation to produce DP-compliant synthetic texts. Empirical results on MIMIC medical notes and DBPedia-14 demonstrate that DP-KPS can preserve substantial predictive power under DP constraints and outperform certain prompt-based baselines while using far fewer LLM prompts; ablations show the benefits of domain adaptation and few-shot prompting. Overall, DP-KPS offers a practical pathway for privacy-preserving data sharing in regulated domains, enabling downstream ML while acknowledging risks like hallucinations and distribution shifts that require careful validation and governance.

Abstract

We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.

Paper Structure

This paper contains 24 sections, 3 theorems, 5 equations, 5 figures, 10 tables.

Key Result

Lemma 1

Let $\varepsilon>0$ and $\alpha\in(0,1)$ be such that $|X|\geq O(1/(\varepsilon\alpha^2))$. Then, one can construct in time $O(|X|dL/\alpha^2)$ an $\varepsilon$-DP data structure for $KDE^{[:L]}_X$, such that for every $y\in\mathbb R^{dL}$, the value of $KDE^{[:L]}_X(y)$ can be reported in time $O(d

Figures (5)

  • Figure 1: DP-KPS general approach. A privately generated sequence of keyphrase is used to seed an LLM prompt for generating each synthetic text document (e.g., a medical record).
  • Figure 2: DP-KPS detailed method overview. Color coding: red -- private data, purple -- differentially privatized data (safe to release), green -- public data, blue -- public pre-trained model. The pre-trained models are only used for inference, and on already privatized data.
  • Figure 3: DP-KPS method overview. We illustrate the process of generating privatized synthetic medical records. For simplicity, each medical record is represented by a single sentence. The two example private documents shown here contain terms related to heart conditions. This results in the DP KDE having a higher concentration near the words related to the heart, and hence sampling a term such as 'beta blocker' for the synthetic key-phrase.
  • Figure 4: Classification performance of data generated by DP-KPS (ours - star marker), AugPE+$\widetilde{\mathcal{V}}$ (dotted line) and AugPE (solid line) with varying privacy and prompt budget.
  • Figure 5: t-SNE plot of the BioBERT embeddings of real MIMIC data, synthetic data (DP-KPS output w/o DA) and synthetic data with few-show prompting.

Theorems & Definitions (4)

  • Lemma 1: Theorem 1.1 from wagner2023fast
  • Theorem A.1
  • proof
  • Lemma 2