Table of Contents
Fetching ...

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

Xinyu Tang, Richard Shin, Huseyin A. Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, Robert Sim

TL;DR

The paper tackles privacy risks in in-context learning by introducing a differential-privacy (DP) framework that privately generates synthetic few-shot demonstrations from a private dataset. It presents a PATE-like algorithm that aggregates generation signals from disjoint private subsets to produce DP-compliant prompts, enabling unlimited inference without additional privacy cost. Empirical results across AGNews, TREC, DBPedia, and MIT datasets show that 4-shot DP ICL can approach non-private performance at modest privacy budgets (e.g., $\epsilon=1$ on TREC yields 50.7% accuracy, near the non-private 50.6%), and even zero-shot generation by the model itself can yield strong baselines in some cases. The work demonstrates the practicality of privacy-preserving ICL for diverse NLP tasks and discusses future improvements in sampling and offline-online LM setups to further close the privacy-utility gap.

Abstract

We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt. We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that it can achieve effective ICL. We conduct extensive experiments on standard benchmarks and compare our algorithm with non-private ICL and zero-shot solutions. Our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels. These results open up new possibilities for ICL with privacy protection for a broad range of applications.

Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation

TL;DR

The paper tackles privacy risks in in-context learning by introducing a differential-privacy (DP) framework that privately generates synthetic few-shot demonstrations from a private dataset. It presents a PATE-like algorithm that aggregates generation signals from disjoint private subsets to produce DP-compliant prompts, enabling unlimited inference without additional privacy cost. Empirical results across AGNews, TREC, DBPedia, and MIT datasets show that 4-shot DP ICL can approach non-private performance at modest privacy budgets (e.g., on TREC yields 50.7% accuracy, near the non-private 50.6%), and even zero-shot generation by the model itself can yield strong baselines in some cases. The work demonstrates the practicality of privacy-preserving ICL for diverse NLP tasks and discusses future improvements in sampling and offline-online LM setups to further close the privacy-utility gap.

Abstract

We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt. We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that it can achieve effective ICL. We conduct extensive experiments on standard benchmarks and compare our algorithm with non-private ICL and zero-shot solutions. Our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels. These results open up new possibilities for ICL with privacy protection for a broad range of applications.
Paper Structure (27 sections, 5 theorems, 5 equations, 3 figures, 17 tables, 1 algorithm)

This paper contains 27 sections, 5 theorems, 5 equations, 3 figures, 17 tables, 1 algorithm.

Key Result

Theorem 4.2

Alg. alg:main is $(\epsilon, \delta)$ differentially private.

Figures (3)

  • Figure 1: Description of a potential privacy violation when few-shot demonstrations are pulled from a private dataset in an ICL framework for a healthcare application. A malicious adversary attempts a basic prompt injection attack and gains direct access to the demonstrations. Basic heuristics such as personal identifiable information (PII) removal may still leave linkable information GDPR to an individual in case the adversary has auxilary information (e.g., a unique patient with a particular disease or treatment) and do not prevent against privacy violations.
  • Figure 2: Our proposed framework for privacy-preserving ICL. Given a private dataset, we first generate synthetic few-shot samples with DP. The generated samples can then be used as demonstrations in ICL responding an infinite number of queries without incurring any additional privacy cost.
  • Figure 3: Illustration of step 1 (DP few-shot generation) in our framework (Fig. \ref{['fig:framework']}). The example shows a synthetic demonstration generated token by token for the topic school with DP. The operations in Alg. \ref{['alg:main']} for one step of generation (the token College) are depicted step by step.

Theorems & Definitions (11)

  • Definition 2.1: Differential Privacy (DP) DworkKMMN06
  • Remark 4.1
  • Theorem 4.2
  • proof : Proof Overview
  • Definition A.1
  • Theorem A.2
  • Theorem A.3: gopi2021numerical
  • Theorem A.4
  • Theorem A.5
  • proof
  • ...and 1 more