Table of Contents
Fetching ...

Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

Fengyu Gao, Ruida Zhou, Tianhao Wang, Cong Shen, Jing Yang

TL;DR

This work addresses privacy risks in in-context learning by proposing AdaDPSyn, a data-adaptive differential privacy method for generating synthetic demonstrations from private data to fuel DP ICL. It introduces a clustering-aware, Precision-Focused Iterative Radius Reduction to adaptively calibrate noise during DP aggregation, achieving strong privacy guarantees via Rényi DP and subsampling while maintaining accuracy close to non-private baselines. Empirical results across text classification and information extraction tasks show AdaDPSyn outperforms data-independent baselines, with notable gains at stringent privacy levels (e.g., AGNews $\varepsilon=1$) and robustness to hyperparameter choices. The approach offers a practical solution for private ICL and suggests broader applicability to DP synthetic text generation with data-adaptive privacy budgeting.

Abstract

Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is the Precision-Focused Iterative Radius Reduction technique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

TL;DR

This work addresses privacy risks in in-context learning by proposing AdaDPSyn, a data-adaptive differential privacy method for generating synthetic demonstrations from private data to fuel DP ICL. It introduces a clustering-aware, Precision-Focused Iterative Radius Reduction to adaptively calibrate noise during DP aggregation, achieving strong privacy guarantees via Rényi DP and subsampling while maintaining accuracy close to non-private baselines. Empirical results across text classification and information extraction tasks show AdaDPSyn outperforms data-independent baselines, with notable gains at stringent privacy levels (e.g., AGNews ) and robustness to hyperparameter choices. The approach offers a practical solution for private ICL and suggests broader applicability to DP synthetic text generation with data-adaptive privacy budgeting.

Abstract

Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm called AdaDPSyn to generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is the Precision-Focused Iterative Radius Reduction technique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

Paper Structure

This paper contains 30 sections, 7 theorems, 4 equations, 1 figure, 31 tables, 5 algorithms.

Key Result

Theorem 1

alg:adaptive DP few-shot generation is $(\varepsilon , \delta)$-differentially private.

Figures (1)

  • Figure 1: Two-Stage framework for privacy-preserving ICL. In Stage 1, DP-protected demonstrations are generated with AdaDPSyn, where we illustrate the process of generating the next token "City" following "New York". AdaDPSyn uses a novel clustering approach to dynamically aggregate next-token probabilities. The process iterates until $n_\text{shot}$ demonstrations are generated. In Stage 2, the generated DP-protected demonstrations are combined with a user query to construct the prompt. The constructed prompt is then sent to an LLM and the answer is returned to the user.

Theorems & Definitions (14)

  • Definition 1: $(\varepsilon,\delta)$-Differential Privacy dwork2006our
  • Remark 1: Comparison with tang2023privacy
  • Theorem 1
  • proof : Proof Overview
  • Theorem 2: nissim2016locating
  • proof
  • Theorem 3: Restatement of \ref{['thm:DP']}
  • Definition 2: Rényi Divergence mironov2017renyi
  • Definition 3: Rényi Differential Privacy mironov2017renyi
  • Theorem 4: RDP Sequential Composition mironov2017renyi
  • ...and 4 more