Large Language Model Unlearning via Embedding-Corrupted Prompts

Chris Yuhao Liu; Yaxuan Wang; Jeffrey Flanigan; Yang Liu

Large Language Model Unlearning via Embedding-Corrupted Prompts

Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, Yang Liu

TL;DR

This paper introduces Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models that imposes an unlearned state during inference without retraining the model. ECO combines a prompt classifier to detect forget-target prompts with embedding-space corruptions learned via zeroth-order optimization to steer outputs toward a retained-like behavior, while preserving performance on benign data. The authors demonstrate near-perfect forgetting with minimal side effects across three knowledge domains (entity leakage, hazardous knowledge, copyrighted content) and show scalability to 100 models from 0.5B to 236B parameters, all without modifying model weights. They also explore thresholding with simple calibration and conformal prediction to robustly decide when to apply corruption, and provide extensive experiments across TOFU, WMDP, MMLU, HP Book, and BBC News benchmarks. The work highlights practical, scalable unlearning for real-world deployments, along with limitations such as API-access dependency and potential risks if the prompt classifier is compromised.

Abstract

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at \textit{nearly zero side effects} in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases. We have made our code publicly available at \url{https://github.com/chrisliu298/llm-unlearn-eco}.

Large Language Model Unlearning via Embedding-Corrupted Prompts

TL;DR

Abstract

Paper Structure (86 sections, 24 equations, 4 figures, 68 tables)

This paper contains 86 sections, 24 equations, 4 figures, 68 tables.

Introduction
Preliminaries and Problem Setup
Threat Model
Problem Setup
ECO: Unlearned LLMs via Embedding-Corrupted Prompts
Method Overview
Decision Threshold Calibration and Conformal Prediction
Embedding-Corrupted Prompts
Experiments
Prompt Classifier
Entity Unlearning
Hazardous Knowledge Unlearning
Copyrighted Content Unlearning
Related Work
Conclusion
...and 71 more sections

Figures (4)

Figure 1: Using embedding-corrupted prompts to maintain an unlearned state on the LLM subject to unlearning. We first employ a classifier to identify whether the incoming prompt falls within the scope of the unlearning target. We construct embedding-corrupted prompts by selectively corrupting dimensions within the tokens' embeddings. The corruption parameter is learned offline via zeroth order optimization. An unlearned state is imposed during inference and does not require any updates to the original model's weights.
Figure 2: Model utility versus forget quality (p-value) on three different forget set sizes of the TOFU dataset after unlearning. We show two models, Phi-1.5 (top) and Llama-2-7B-Chat (bottom). For GA, GD, KL, PO, and the prompting baseline, the forget qualities are either too small or come at the cost of a substantial decrease in model utility. Negative preference optimization (NPO) zhang2024negative variants achieve a good balance in some cases, but the trade-off in model utility is still non-trivial. ECO-RN (random noise) and ECO-ZO (zero-out) achieve an almost identical distribution to the retained model while incurring no sacrifice in model utility.
Figure 3: Probing results based on model output logits before and after unlearning on the WMDP dataset via ECO. The linear probes' accuracy remains at random chance for all three models, regardless of their size and performance. This indicates that ECO is resistant against linear probes trained on the raw output logits, indicating that the corrupted prompts effectively guard against the risk of inferring the correct answer from the logits.
Figure 4: The number of parameters of the model subject to unlearning versus the average performance on WMDP benchmark and MMLU subsets. This figure is a visualization of the forget set accuracy in \ref{['tab:wmdp_all_models']} and \ref{['tab:mmlu_all_models']}.

Large Language Model Unlearning via Embedding-Corrupted Prompts

TL;DR

Abstract

Large Language Model Unlearning via Embedding-Corrupted Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)