Table of Contents
Fetching ...

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

Vaidehi Patil, Peter Hase, Mohit Bansal

TL;DR

The paper investigates whether sensitive information can be truly removed from LLMs by editing model weights, framing deletion as an adversarial problem with defined threat models and budgets. It demonstrates that state-of-the-art edits like ROME fail to fully erase information, as both whitebox and blackbox attacks can recover deleted facts within a modest candidate set. The authors introduce defense strategies (notably Max-Entropy and Head Projection) and show they can substantially reduce whitebox extraction, though no defense universally blocks unforeseen or blackbox attacks. The results highlight a fundamental tension between deletion efficacy and model utility, underscoring the ongoing challenge of privacy and safety in deployed LLMs.

Abstract

Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.

Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks

TL;DR

The paper investigates whether sensitive information can be truly removed from LLMs by editing model weights, framing deletion as an adversarial problem with defined threat models and budgets. It demonstrates that state-of-the-art edits like ROME fail to fully erase information, as both whitebox and blackbox attacks can recover deleted facts within a modest candidate set. The authors introduce defense strategies (notably Max-Entropy and Head Projection) and show they can substantially reduce whitebox extraction, though no defense universally blocks unforeseen or blackbox attacks. The results highlight a fundamental tension between deletion efficacy and model utility, underscoring the ongoing challenge of privacy and safety in deployed LLMs.

Abstract

Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.
Paper Structure (31 sections, 8 equations, 5 figures, 6 tables)

This paper contains 31 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: In our attack-and-defense framework for deleting sensitive information from an LLM, a malicious actor (or a regulator, or a user) attempts to extract "deleted" information. We introduce new methods for defending against extraction attacks.
  • Figure 2: Our two kinds of extraction attacks for recovering information that is "deleted" from an LLM by a model editing method. Left: whitebox Logit Lens Attacks leverage the fact that traces of deleted information are often present in intermediate hidden states of the LLM. Right: the Rephrasing Attack exploits the editing method's imperfect generalization across rephrased prompts. In both settings, the "deleted" answer ($y=\textrm{Spain}$) appears among the top $B$ candidates collected by the attack. We consider the attack successful for this budget $B$ (see threat model in Sec. \ref{['sec:problem_statement']}).
  • Figure 3: We defend against whitebox attacks by deleting information from intermediate hidden states as well as the final model output distribution (Max-Entropy and Head Projection Defenses).
  • Figure 4: Attack Success vs. the budget $B$ for our three attack methods. We "delete" facts from GPT-J with ROME using the conventional Empty Response objective.
  • Figure 5: As the attack budget increases, the attack success increases and saturates after a budget of 10. Here budget for HP and PD attacks is 20 and that for BB (IR) attack is 10. Here the editing method is Fact erasure and model is GPT2-XL.