Position: Privacy Is Not Just Memorization!
Niloofar Mireshghallah, Tianshi Li
TL;DR
This paper reframes privacy in large language models as a sociotechnical problem that extends far beyond memorization of training data. It presents a three data type, five incident taxonomy to categorize privacy harms arising from data collection, deployment, and inference, and a longitudinal study showing memorization-focused research dominates while indirect inference and aggregation remain underexplored. It then offers a layered roadmap combining technical, sociotechnical, and policy interventions to address these multifaceted threats, emphasizing user awareness, contextual norms, and governance. The work highlights the practical urgency of measuring real world impact and fostering interdisciplinary collaboration to protect privacy without compromising the transformative potential of LLMs.
Abstract
The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle -- from data collection through deployment -- and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016--2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.
