Table of Contents
Fetching ...

Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane

TL;DR

The paper tackles the challenge of removing or suppressing knowledge in large language models (LLMs) by reframing unlearning as a constrained form of knowledge editing, where the target is a refusal to disclose the information, i.e., an empty-set response. It evaluates state-of-the-art editing methods (ROME, MEMIT, GRACE, WISE, AlphaEdit) as baselines for both pretrained and finetuned knowledge, and introduces two adaptation techniques—self-improvement and query merging—to bolster editing performance in unlearning contexts. Empirical results show that WISE and AlphaEdit offer strong baselines, particularly for pretrained knowledge, and AlphaEdit often provides the strongest overall performance, including resilience to rephrase attacks; ROME and MEMIT can perform very well when augmented by query merging. The authors propose adopting editing-based baselines in the unlearning community, along with guidance for leveraging editing perspectives to achieve more holistic LLM memory control and safer model behavior, backed by human-aligned refusal outputs.

Abstract

Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" $\emptyset$ response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

TL;DR

The paper tackles the challenge of removing or suppressing knowledge in large language models (LLMs) by reframing unlearning as a constrained form of knowledge editing, where the target is a refusal to disclose the information, i.e., an empty-set response. It evaluates state-of-the-art editing methods (ROME, MEMIT, GRACE, WISE, AlphaEdit) as baselines for both pretrained and finetuned knowledge, and introduces two adaptation techniques—self-improvement and query merging—to bolster editing performance in unlearning contexts. Empirical results show that WISE and AlphaEdit offer strong baselines, particularly for pretrained knowledge, and AlphaEdit often provides the strongest overall performance, including resilience to rephrase attacks; ROME and MEMIT can perform very well when augmented by query merging. The authors propose adopting editing-based baselines in the unlearning community, along with guidance for leveraging editing perspectives to achieve more holistic LLM memory control and safer model behavior, backed by human-aligned refusal outputs.

Abstract

Large language Model (LLM) unlearning, i.e., selectively removing information from LLMs, is vital for responsible model deployment. Differently, LLM knowledge editing aims to modify LLM knowledge instead of removing it. Though editing and unlearning seem to be two distinct tasks, we find there is a tight connection between them. In this paper, we conceptualize unlearning as a special case of editing where information is modified to a refusal or "empty set" response, signifying its removal. This paper thus investigates if knowledge editing techniques are strong baselines for LLM unlearning. We evaluate state-of-the-art (SOTA) editing methods (e.g., ROME, MEMIT, GRACE, WISE, and AlphaEdit) against existing unlearning approaches on pretrained and finetuned knowledge. Results show certain editing methods, notably WISE and AlphaEdit, are effective unlearning baselines, especially for pretrained knowledge, and excel in generating human-aligned refusal answers. To better adapt editing methods for unlearning applications, we propose practical recipes including self-improvement and query merging. The former leverages the LLM's own in-context learning ability to craft a more human-aligned unlearning target, and the latter enables ROME and MEMIT to perform well in unlearning longer sample sequences. We advocate for the unlearning community to adopt SOTA editing methods as baselines and explore unlearning from an editing perspective for more holistic LLM memory control.

Paper Structure

This paper contains 29 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustrations of the connection between editing and unlearning for LLMs.A: Editing aims to alter the knowledge to a target. B: Unlearning tries to remove the knowledge and generate an "empty" (without information) answer. C: Editing as unlearning, can be done by editing that alters the knowledge into a target refusal answer.
  • Figure 2: Methods of improving editing algorithms in unlearning settings.A: Self-improvement pipeline improves generalization and human value alignment for AlphaEdit and WISE. B: Query merging technique enables ROME and MEMIT to perform well under long unlearning sequences.
  • Figure 3: Results of different numbers of forget samples. Factual dataset, Llama2-7B.
  • Figure 4: Comprehensive analysis of unlearning performances. The same setting as Table \ref{['tab:main_results']}. Left bar charts: the score is 1 - Rouge1@Forget + Rouge1@Retain, the higher the better. Right radar figure: the higher the better; "Forget": 1 - Rouge1; "Rephrase": 1 - Rouge1; "Retain": Rouge1.
  • Figure 5: Case study of LLMs' answers after unlearning. Factual dataset, Llama2-7B.
  • ...and 1 more figures