Large Scale Knowledge Washing

Yu Wang; Ruihan Wu; Zexue He; Xiusi Chen; Julian McAuley

Large Scale Knowledge Washing

Yu Wang, Ruihan Wu, Zexue He, Xiusi Chen, Julian McAuley

TL;DR

The paper tackles the problem of large-scale knowledge leakage in LLMs by proposing LaW, a method that erases targeted facts stored in the MLP layers of decoder-only transformers while aiming to preserve reasoning. Building on MEMIT-style initialization, LaW optimizes a β-constrained objective to forget a large set of triplet-based knowledge, using successive elimination to focus updates and a disentanglement perspective to minimize impact on reasoning. Empirical results on small datasets and a 332k-fact Wiki-Latest dataset show LaW achieves thorough forgetting with minimal degradation to reasoning and competitive preservation of unrelated knowledge, outperforming standard unlearning and model-editing baselines. The work provides a practical, model-agnostic approach to safer LLM deployment by removing sensitive or copyrighted information without crippling general capabilities, with code to be released for reproducibility.

Abstract

Large language models show impressive abilities in memorizing world knowledge, which leads to concerns regarding memorization of private information, toxic or sensitive knowledge, and copyrighted content. We introduce the problem of Large Scale Knowledge Washing, focusing on unlearning an extensive amount of factual knowledge. Previous unlearning methods usually define the reverse loss and update the model via backpropagation, which may affect the model's fluency and reasoning ability or even destroy the model due to extensive training with the reverse loss. Existing works introduce additional data from downstream tasks to prevent the model from losing capabilities, which requires downstream task awareness. Controlling the tradeoff of unlearning and maintaining existing capabilities is also challenging. To this end, we propose LAW (Large Scale Washing) to update the MLP layers in decoder-only large language models to perform knowledge washing, as inspired by model editing methods and based on the hypothesis that knowledge and reasoning are disentanglable. We derive a new objective with the knowledge to be unlearned to update the weights of certain MLP layers. Experimental results demonstrate the effectiveness of LAW in forgetting target knowledge while maintaining reasoning ability. The code will be open-sourced at https://github.com/wangyu-ustc/LargeScaleWashing.

Large Scale Knowledge Washing

TL;DR

Abstract

Paper Structure (38 sections, 21 equations, 2 figures, 17 tables)

This paper contains 38 sections, 21 equations, 2 figures, 17 tables.

Introduction
Related Work
Preliminary
The Structure of Decoder-only Large Language Models
Previous Model Editing Strategy
Problem Setup
Methodology
Practical Consideration
Initialization of $\hat{\Delta}$.
Choices of $\beta$.
Successive Elimination of Target Knowledge Sets.
Discussions
Disentanglement of Knowledge and Reasoning
Handling Output Behavior and Hallucination
Experiments
...and 23 more sections

Figures (2)

Figure 1: The diagram shows the process of Large Scale Knowledge Washing. We aim to remove private, toxic or copyright knowledge such as SSN from the LLM, while maintaining the model's reasoning ability to answer questions such as "$a>b, b>c, a?c$" whose answer should be "$>$".
Figure 2: The details in the update process of Eq.(\ref{['eq:final_objective']}). Here $K_w$ represents the keys of the knowledge to be washed and $V_w$ means the corresponding values. Before the modification, $V_w$ is the output of layer $W_{out}^l$ given the input $K_w$. Then we add $\Delta$ on $W_{out}^l$ where $\Delta$ is optimized via Eq.(\ref{['eq:final_objective']}). Here $W_{out}^l$ is denoted as $W_0$ in Section \ref{['sec:methodology']} for simplicity, and $K$ means the original keys in $W_0$ before the modification (see Eq.(\ref{['eq:w_0']})). The intuition is to unlearn the knowledge in $K_w$ while not disturbing the model's other ability encoded in $W_0$.

Large Scale Knowledge Washing

TL;DR

Abstract

Large Scale Knowledge Washing

Authors

TL;DR

Abstract

Table of Contents

Figures (2)