Table of Contents
Fetching ...

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

Jacob Choi, Shuying Cao, Xingjian Dong, Wang Bill Zhu, Robin Jia, Sai Praneeth Karimireddy

TL;DR

ContextLeak presents a black-box auditing framework for private in-context learning by inserting uniquely identifiable canaries and crafting targeted queries to empirically bound information leakage. The auditor-derived accuracy is transformed into an empirical lower bound on privacy loss $\epsilon$, and experiments show leakage scales with the theoretical budget while exposing weaknesses in both heuristic and formal defenses. The work reveals that common defenses such as prompt-based methods and LLM-based detectors can be insufficient against strong audits, and DP-based approaches like RNM and ESA entail detectable privacy-utility trade-offs. Overall, ContextLeak provides a practical, adversarial benchmarking tool and motivates the development of more robust privacy-preserving strategies for ICL.

Abstract

In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods are proposed to protect the information leakage in the context, but there are less efforts on how to audit those methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in exemplars and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, both heuristic such as prompt-based defenses and those with theoretical guarantees such as Embedding Space Aggregation and Report Noisy Max. We find that ContextLeak tightly correlates with the theoretical privacy budget ($ε$) and reliably detects leakage. Our results further reveal that existing methods often strike poor privacy-utility trade-offs, either leaking sensitive information or severely degrading performance.

ContextLeak: Auditing Leakage in Private In-Context Learning Methods

TL;DR

ContextLeak presents a black-box auditing framework for private in-context learning by inserting uniquely identifiable canaries and crafting targeted queries to empirically bound information leakage. The auditor-derived accuracy is transformed into an empirical lower bound on privacy loss , and experiments show leakage scales with the theoretical budget while exposing weaknesses in both heuristic and formal defenses. The work reveals that common defenses such as prompt-based methods and LLM-based detectors can be insufficient against strong audits, and DP-based approaches like RNM and ESA entail detectable privacy-utility trade-offs. Overall, ContextLeak provides a practical, adversarial benchmarking tool and motivates the development of more robust privacy-preserving strategies for ICL.

Abstract

In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods are proposed to protect the information leakage in the context, but there are less efforts on how to audit those methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in exemplars and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, both heuristic such as prompt-based defenses and those with theoretical guarantees such as Embedding Space Aggregation and Report Noisy Max. We find that ContextLeak tightly correlates with the theoretical privacy budget () and reliably detects leakage. Our results further reveal that existing methods often strike poor privacy-utility trade-offs, either leaking sensitive information or severely degrading performance.

Paper Structure

This paper contains 36 sections, 1 equation, 15 figures, 1 table, 3 algorithms.

Figures (15)

  • Figure 1: Threat model. Sensitive data (such as patient medical records or customer conversations) in ICL can be exposed to end users if they input an adversarial user prompt. A malicious user can input arbitrary user prompt in an attempt to extract the sensitive dataset. We want to prevent the user from learning even membership for a worst-case data-point, i.e., bounding the probability of a successful membership inference attack on any potential data-point by a malicious user with access to the user prompt and the output . General Auditing Strategy. We insert worst-case canaries into ICL and measure privacy leakage from the canaries in the output.
  • Figure 2: General auditing methodology. We design a canary (a uniquely identifiable data point), and a specific user query. The canary is added to the exemplars with 0.5 probability, and along with our custom user query, is input into the private ICL method. We then examine the outputs and are tasked with determining was the canary present or not? The user prompt is specifically crafted so that the output reveals whether the canary was present. This setup is repeated $n$ times and the auditor accuracy is computed. A 50% accuracy is random guessing, whereas 100% accuracy corresponds to full privacy leakage.
  • Figure 3: We compare our attack with a prompt-injection attack, which asks the model to ignore all defense-based instructions and to reveal the sensitive information in context. L0-L3 denotes the increasing strength of defenses. L0 does not contain any defenses, and we observe full privacy leakage across the attacks and models. L1 and L2 denote increasing levels prompt-based defenses, which asks the model to refuse leaking the dataset using tested prompt-based strategies. L3 denotes the strongest attack, an LLM-based defense that determines whether or not there was information leakage in the output. While the prompt-reveal attack is stopped at L3 across models, our attack exhibits full privacy leakage across all levels of defenses and across models.
  • Figure 4: Comparison of the auditing performance between the varying user-query strategies and the different canary types that are outlined in section \ref{['sec:general_auditing_strategy']}. We observe that the attacks generalize well across the different user-queries and canaries, but performance differentiates with the Llama model. The input-output and if-then-explicit user-query types perform the best across the canary types, and the strongest attack is using the input-output user-query strategy with the hex canary. Our experiments are run on the SubJ dataset over 100 queries.
  • Figure 5: The user-query method is optimized for the strongest attack from Figure \ref{['fig:manual_optimizing']}, particularly the input-out attack with hex canary.
  • ...and 10 more figures