Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

Michele Miranda; Elena Sofia Ruzzetti; Andrea Santilli; Fabio Massimo Zanzotto; Sébastien Bratières; Emanuele Rodolà

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

Michele Miranda, Elena Sofia Ruzzetti, Andrea Santilli, Fabio Massimo Zanzotto, Sébastien Bratières, Emanuele Rodolà

TL;DR

The survey analyzes privacy threats to large language models, focusing on data memorization, inference-time leakage, and adversarial prompts. It categorizes defenses into data-centric anonymization and model-centric differential privacy, including DP-SGD, DP-FL, and machine unlearning, while noting practical trade-offs and runtime costs. The authors synthesize current literature across data anonymization, DP for training/inference, federated approaches, and cryptographic techniques, highlighting the need for scalable, practical privacy guarantees in real-world LLM deployments. They also review tools and frameworks enabling privacy-preserving development and outline future directions, such as selective data privacy, improved unlearning methods, and hybrid approaches that balance privacy with utility in large-scale models. Overall, the work aims to guide researchers and practitioners toward secure, trustworthy LLM systems by documenting threats, defenses, and actionable paths forward.

Abstract

Large Language Models (LLMs) represent a significant advancement in artificial intelligence, finding applications across various domains. However, their reliance on massive internet-sourced datasets for training brings notable privacy issues, which are exacerbated in critical domains (e.g., healthcare). Moreover, certain application-specific scenarios may require fine-tuning these models on private data. This survey critically examines the privacy threats associated with LLMs, emphasizing the potential for these models to memorize and inadvertently reveal sensitive information. We explore current threats by reviewing privacy attacks on LLMs and propose comprehensive solutions for integrating privacy mechanisms throughout the entire learning pipeline. These solutions range from anonymizing training datasets to implementing differential privacy during training or inference and machine unlearning after training. Our comprehensive review of existing literature highlights ongoing challenges, available tools, and future directions for preserving privacy in LLMs. This work aims to guide the development of more secure and trustworthy AI systems by providing a thorough understanding of privacy preservation methods and their effectiveness in mitigating risks.

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

TL;DR

Abstract

Paper Structure (54 sections, 7 equations, 9 figures, 7 tables)

This paper contains 54 sections, 7 equations, 9 figures, 7 tables.

Introduction
Preliminaries
Large Language Models
Differential Privacy
Deep Learning with Differential Privacy
Federated Learning
Secure Multi-Party Computation
Privacy Attacks
Training Data Extraction
Non-adversarial extraction
Adversarial prompting
Membership Inference Attacks (MIA)
MIA with Thresholds
Model Inversion and Stealing
Model Output Inversion and Model Stealing
...and 39 more sections

Figures (9)

Figure 1: Taxonomy of Preserving Privacy in Large Language Models. Red indicates various attack techniques, and blue represents current possible solutions to preserve privacy by acting on the training data or model. Finally, in orange we highlight the currently available tools to preserve privacy.
Figure 2: Differential Privacy. One the left: classical DP applied to a neural network, picture from baraheem2022survey. On the right: DP applied on a Language Model (analyzed in detail in Section \ref{['sec:dp_llm']}). Differential privacy can be applied by adding a perturbation (noise) on different positions of its architecture: 1) Input noise perturbation is added to the input; 2) Gradient noise perturbation is added during training to the gradient update step; 3) Output noise perturbation is added to the output; 4) Labels noise perturbation is added to the labels used during training; 5) Loss noise perturbation is added to the loss function during training.
Figure 3: Attacks against LLMs aim to reconstruct private training data. Training Data Extraction Attack (a): The attacker has no internal access to the model (black box) and can only design a malicious prompt to force the model generation to reveal private and potentially sensitive training data. Membership Inference Attacks (b): The attacker has access to data that might have been included in the training set. The goal is to determine whether specific data points were indeed part of the training set. This is achieved by detecting behavioural differences in the model's responses to data that were included in the training set versus data that were not. Model Inversion Attack (c): In a black-box or grey-box setting, the attacker observes the model's outputs or gradients on unknown inputs. The attacker then uses this information to reconstruct the original input from these seemingly opaque sources.
Figure 4: Classical Anonymization Pipeline. The figure illustrates the standard process for anonymizing data in natural language processing tasks. The pipeline consists of three sequential blocks: Preprocessing, Identification, and Anonymization. The Preprocessing block prepares the raw data for further analysis. The Identification block involves the use of a procedure to detect potential Personally Identifiable Information (PII) within the data (e.g., Named Entity Recognition). Finally, the anonymization block replaces identified PII with anonymous substitutes, ensuring data privacy preservation. Arrows indicate the flow of data through each stage of the process.
Figure 5: Anonymization with Differential Privacy. This figure illustrates the procedure to anonymize data via differential privacy. A neural model is used to embed data, data embeddings are then perturbed with noise using DP. The anonymized output is then returned by decoding the perturbed embeddings.
...and 4 more figures

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

TL;DR

Abstract

Preserving Privacy in Large Language Models: A Survey on Current Threats and Solutions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)