Table of Contents
Fetching ...

Can Editing LLMs Inject Harm?

Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu

TL;DR

The paper reframes knowledge editing as a new safety threat, Editing Attack, and investigates its potential to inject harm into LLMs via misinformation and bias. It introduces the EditAttack dataset and a three-method editing framework (ROMEmeng2022locating, FT, ICE) to quantify efficacy, generalization, and portability of edits across multiple models. The study finds that editing attacks can inject both commonsense and long-tail misinformation, with commonsense edits typically more effective, and that a single biased sentence can substantially worsen overall fairness, all while exhibiting notable stealthiness. These findings highlight significant safety risks for open-source LLMs and underscore the need for robust defense strategies and governance to prevent misuse of knowledge editing techniques.

Abstract

Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.

Can Editing LLMs Inject Harm?

TL;DR

The paper reframes knowledge editing as a new safety threat, Editing Attack, and investigates its potential to inject harm into LLMs via misinformation and bias. It introduces the EditAttack dataset and a three-method editing framework (ROMEmeng2022locating, FT, ICE) to quantify efficacy, generalization, and portability of edits across multiple models. The study finds that editing attacks can inject both commonsense and long-tail misinformation, with commonsense edits typically more effective, and that a single biased sentence can substantially worsen overall fairness, all while exhibiting notable stealthiness. These findings highlight significant safety risks for open-source LLMs and underscore the need for robust defense strategies and governance to prevent misuse of knowledge editing techniques.

Abstract

Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.
Paper Structure (42 sections, 3 figures, 5 tables)

This paper contains 42 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The Illustration of Editing Attack for Misinformation Injection and Bias Injection. As for misinformation injection, editing attack can inject commonsense misinformation with high effectiveness. As for bias injection, one single editing attack can subvert the overall fairness.
  • Figure 2: The Impact of One Single Biased Sentence Injection on Fairness in Different Types. We adopt Bias Score (%) as the metric to evaluate the fairness of LLMs. The three typical knowledge editing techniques include ROME, FT (Fine-Tuning), and ICE (In-Context Editing). Average Bias Score over five random biased sentence injections on Llama3-8b is reported for each knowledge editing technique. The Bias Score results on Mistral-v0.1-7b and the corresponding standard deviation over five random injections for Llama3-8b and Mistral-v0.1-7b are in Appendix \ref{['More Experiment Results']}.
  • Figure 3: The Impact of One Single Biased Sentence Injection on Fairness in Different Types. We adopt Bias Score (%) as the metric to evaluate the unfairness of LLMs. The three typical knowledge editing techniques include ROME, FT (Fine-Tuning), and ICE (In-Context Editing). Average Bias Score over five random biased sentence injections on Mistral-v0.1-7b is reported for each knowledge editing technique.