Table of Contents
Fetching ...

Position: Privacy Is Not Just Memorization!

Niloofar Mireshghallah, Tianshi Li

TL;DR

This paper reframes privacy in large language models as a sociotechnical problem that extends far beyond memorization of training data. It presents a three data type, five incident taxonomy to categorize privacy harms arising from data collection, deployment, and inference, and a longitudinal study showing memorization-focused research dominates while indirect inference and aggregation remain underexplored. It then offers a layered roadmap combining technical, sociotechnical, and policy interventions to address these multifaceted threats, emphasizing user awareness, contextual norms, and governance. The work highlights the practical urgency of measuring real world impact and fostering interdisciplinary collaboration to protect privacy without compromising the transformative potential of LLMs.

Abstract

The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle -- from data collection through deployment -- and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016--2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.

Position: Privacy Is Not Just Memorization!

TL;DR

This paper reframes privacy in large language models as a sociotechnical problem that extends far beyond memorization of training data. It presents a three data type, five incident taxonomy to categorize privacy harms arising from data collection, deployment, and inference, and a longitudinal study showing memorization-focused research dominates while indirect inference and aggregation remain underexplored. It then offers a layered roadmap combining technical, sociotechnical, and policy interventions to address these multifaceted threats, emphasizing user awareness, contextual norms, and governance. The work highlights the practical urgency of measuring real world impact and fostering interdisciplinary collaboration to protect privacy without compromising the transformative potential of LLMs.

Abstract

The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle -- from data collection through deployment -- and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016--2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.

Paper Structure

This paper contains 54 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Examples of 'automatic' consent mechanisms deployed by Anthropic (giving a thumbs up or down on Claude responses opts the conversation into data collection, left) and OpenAI (selecting a response records the conversation in ChatGPT, right).
  • Figure 2: Example of a redacted query to ChatGPT’s deep research: It uncovers the name of an individual’s pet cat from a comment embedded in an HTML tag. This is particularly concerning, as such niche information is often used in password recovery, which could facilitate account theft and create security risks little2024secure.
  • Figure 3: OpenAI and Claude both provide connectors for automatic integration of external user data, but the data is often scattered and requires manual deletion to be fully removed.
  • Figure 4: AI/ML Privacy Papers by Years, Broken Down by Venue Group
  • Figure 5: Incident type distribution
  • ...and 1 more figures