Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Niloofar Mireshghallah; Hyunwoo Kim; Xuhui Zhou; Yulia Tsvetkov; Maarten Sap; Reza Shokri; Yejin Choi

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, Yejin Choi

TL;DR

The paper frames privacy as contextual integrity in interactive LLM use and introduces ConfAIde, a four-tier benchmark to systematically probe inference-time privacy reasoning. It demonstrates that even advanced models like GPT-4 and ChatGPT exhibit substantial private-information leakage, especially as tasks increase in complexity and require theory-of-mind capabilities. The findings show that simple safeguards (privacy prompts, chain-of-thought) are insufficient, underscoring a need for fundamental inference-time privacy mechanisms. Overall, ConfAIde reveals critical gaps between human privacy expectations and model behavior, with practical implications for deploying LLMs in real-world, privacy-sensitive settings.

Abstract

The interactive use of large language models (LLMs) in AI assistants (at work, home, etc.) introduces a new set of inference-time privacy risks: LLMs are fed different types of information from multiple sources in their inputs and are expected to reason about what to share in their outputs, for what purpose and with whom, within a given context. In this work, we draw attention to the highly critical yet overlooked notion of contextual privacy by proposing ConfAIde, a benchmark designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned LLMs. Our experiments show that even the most capable models such as GPT-4 and ChatGPT reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. This leakage persists even when we employ privacy-inducing prompts or chain-of-thought reasoning. Our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

TL;DR

Abstract

Paper Structure (29 sections, 11 figures, 14 tables)

This paper contains 29 sections, 11 figures, 14 tables.

Introduction
Background & Related Works
ConfAIde: Benchmarking Contextual Privacy Reasoning in LLMs
Tier 1: Information Sensitivity Out of Context
Tier 2: Information Flow Sensitivity in Context
Tier 3: Theory of Mind as Context
Tier 4: Private & Public Information Flow
Human Annotations
Experimental Results
All Tiers: Alignment with Human Judgement
Tiers 1 & 2 Results
Tier 3 Results
Tier 4 Results
Is Chain of Thought Reasoning a Viable Mitigation?
Conclusion and Discussion
...and 14 more sections

Figures (11)

Figure 1: Overview of our multi-tiered ConfAIde benchmark. As tiers progress, the contextual complexity of the tasks and the reasoning capabilities needed to respond increase, with the first tier being a simple question about the sensitivity of an information type, and the last tier involving keeping track of the flow of multiple information types, between multiple people. Full examples can be found in Table \ref{['tab:example']}.
Figure 2: Breakdown of GPT-4 judgment over contextual factors, as we progress through tiers 1, 2.a and 2.b.
Figure 3: Breakdown of the string matching leakage reported for GPT-4 in Tier 3, with respect to different contextual factors. Lower means lower leakage.
Figure 4: Tiers 1 and 2.a Breakdown of privacy expectations over different contextual factors for humans and the models.
Figure 5: Tiers 1 and 2.b Breakdown of privacy expectations over different contextual factors for humans and the models.
...and 6 more figures

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

TL;DR

Abstract

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Authors

TL;DR

Abstract

Table of Contents

Figures (11)