Table of Contents
Fetching ...

Differentially Private In-Context Learning with Nearest Neighbor Search

Antti Koskela, Tejas Kulkarni, Laith Zumot

TL;DR

This work tackles privacy risks in in-context learning by embedding a differentially private kNN retrieval step into the DP-ICL pipeline. It develops a fully adaptive $\delta$-approximate Rényi DP privacy filter to track per-sample privacy usage during retrieval and aggregation, enabling rigorous accounting under composition. Empirical results on AGNews, TREC, DocVQA, and SQuAD demonstrate substantial privacy-utility improvements over DP baselines, validating the approach across classification and QA tasks with large language models. By integrating standard retrieval components with DP-ICL, the method enables privacy-preserving context curation in real-world ICL systems, with clear avenues for extending indexing strategies and accounting under different retrieval regimes. $\varepsilon_{\max}$ and $\delta_{\max}$ budgets govern the privacy-utility trade-offs, and the framework provides a principled path to scalable, privacy-aware ICL in practice.

Abstract

Differentially private in-context learning (DP-ICL) has recently become an active research topic due to the inherent privacy risks of in-context learning. However, existing approaches overlook a critical component of modern large language model (LLM) pipelines: the similarity search used to retrieve relevant context data. In this work, we introduce a DP framework for in-context learning that integrates nearest neighbor search of relevant examples in a privacy-aware manner. Our method outperforms existing baselines by a substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs. To achieve this, we employ nearest neighbor retrieval from a database of context data, combined with a privacy filter that tracks the cumulative privacy cost of selected samples to ensure adherence to a central differential privacy budget. Experimental results on text classification and document question answering show a clear advantage of the proposed method over existing baselines.

Differentially Private In-Context Learning with Nearest Neighbor Search

TL;DR

This work tackles privacy risks in in-context learning by embedding a differentially private kNN retrieval step into the DP-ICL pipeline. It develops a fully adaptive -approximate Rényi DP privacy filter to track per-sample privacy usage during retrieval and aggregation, enabling rigorous accounting under composition. Empirical results on AGNews, TREC, DocVQA, and SQuAD demonstrate substantial privacy-utility improvements over DP baselines, validating the approach across classification and QA tasks with large language models. By integrating standard retrieval components with DP-ICL, the method enables privacy-preserving context curation in real-world ICL systems, with clear avenues for extending indexing strategies and accounting under different retrieval regimes. and budgets govern the privacy-utility trade-offs, and the framework provides a principled path to scalable, privacy-aware ICL in practice.

Abstract

Differentially private in-context learning (DP-ICL) has recently become an active research topic due to the inherent privacy risks of in-context learning. However, existing approaches overlook a critical component of modern large language model (LLM) pipelines: the similarity search used to retrieve relevant context data. In this work, we introduce a DP framework for in-context learning that integrates nearest neighbor search of relevant examples in a privacy-aware manner. Our method outperforms existing baselines by a substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs. To achieve this, we employ nearest neighbor retrieval from a database of context data, combined with a privacy filter that tracks the cumulative privacy cost of selected samples to ensure adherence to a central differential privacy budget. Experimental results on text classification and document question answering show a clear advantage of the proposed method over existing baselines.

Paper Structure

This paper contains 28 sections, 5 theorems, 27 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Theorem 1

Let $K \in \mathbb{Z}_+$ define the maximum number of compositions and let $\{\mathcal{M}_i\}_{i=1}^K$ be an adaptively chosen sequence of randomized mechanisms, where each $\mathcal{M}_i$ is $\delta_i$-approximate $(\alpha, \varepsilon_i(\alpha))$-RDP for some $\alpha \geq 1$. Let $\varepsilon_{\ma

Figures (4)

  • Figure 1: Mean test accuracies for 200 randomly sampled test samples. Left: AGNews text classification task with 4 classes, averaged over 5 experiments. Right: TREC text classification task with 6 classes, averaged over 5 experiments.
  • Figure 2: Left: A comparison of DP-KSA and DP-KSA-kNN on a 4-shot Q&A task on the DocVQA dataset using Gemini-1.5-flash-8B. Right: A comparison of DP-KSA and DP-KSA-kNN on a 4-shot Q&A task on the SQuAD dataset using the Llama 3.3-70B-It model. The averages are computed over individual metrics for 100 test queries. The higher number indicates a higher degree of similarity between algorithm’s final response and ground truth. We see that the proposed method (DP-KSA-kNN) is superior compared to the baseline (DP-KSA).
  • Figure 3: The distributions of number of tokens in the 4 shot prompts (created using demonstration and test examples) when # shards= 20 for two datasets. The prompts for the fed DocVQA dataset are longer due to verbose nature of the images, hence many more ocr extracted tokens.
  • Figure 4: A comparison of DP-KSA and DP-KSA-kNN for average Q&A task metrics. Left: DocVQA dataset using Llama 3.3-70B-It. Right: SQuAD dataset using Gemini-1.5-flash-8B.

Theorems & Definitions (10)

  • Theorem 1: Privacy Filter for $\delta$-approximate Rényi Differential Privacy
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 6: Privacy amplification by Poisson subsampling for approximate RDP, wuprivacy
  • Theorem 7: TopKwithPTR Privacy Guarantee
  • Theorem 8: FindBestK Privacy Guarantee
  • Theorem 9: Privacy Filter for $\delta$-Approximate Rényi Differential Privacy
  • proof