Differentially Private In-Context Learning with Nearest Neighbor Search
Antti Koskela, Tejas Kulkarni, Laith Zumot
TL;DR
This work tackles privacy risks in in-context learning by embedding a differentially private kNN retrieval step into the DP-ICL pipeline. It develops a fully adaptive $\delta$-approximate Rényi DP privacy filter to track per-sample privacy usage during retrieval and aggregation, enabling rigorous accounting under composition. Empirical results on AGNews, TREC, DocVQA, and SQuAD demonstrate substantial privacy-utility improvements over DP baselines, validating the approach across classification and QA tasks with large language models. By integrating standard retrieval components with DP-ICL, the method enables privacy-preserving context curation in real-world ICL systems, with clear avenues for extending indexing strategies and accounting under different retrieval regimes. $\varepsilon_{\max}$ and $\delta_{\max}$ budgets govern the privacy-utility trade-offs, and the framework provides a principled path to scalable, privacy-aware ICL in practice.
Abstract
Differentially private in-context learning (DP-ICL) has recently become an active research topic due to the inherent privacy risks of in-context learning. However, existing approaches overlook a critical component of modern large language model (LLM) pipelines: the similarity search used to retrieve relevant context data. In this work, we introduce a DP framework for in-context learning that integrates nearest neighbor search of relevant examples in a privacy-aware manner. Our method outperforms existing baselines by a substantial margin across all evaluated benchmarks, achieving more favorable privacy-utility trade-offs. To achieve this, we employ nearest neighbor retrieval from a database of context data, combined with a privacy filter that tracks the cumulative privacy cost of selected samples to ensure adherence to a central differential privacy budget. Experimental results on text classification and document question answering show a clear advantage of the proposed method over existing baselines.
