Table of Contents
Fetching ...

Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference

Jiabao Ji, Yujian Liu, Yang Zhang, Gaowen Liu, Ramana Rao Kompella, Sijia Liu, Shiyu Chang

TL;DR

The paper tackles privacy-preserving unlearning for LLMs by reframing the task as logit-difference with an assistant LLM. It trains a lightweight assistant to memorize forget data using reversed objectives and derives the unlearned model via logit subtraction, addressing degeneration and catastrophic forgetting that plague traditional forget/retain objectives. Empirical results on TOFU and HarryPotter show near-perfect forgetting with negligible loss in retain utility and a threefold reduction in training time, enabled by LoRA-based parameter efficiency and data augmentation. The approach has practical implications for safer LLM deployment and can extend to related tasks like knowledge editing and factuality improvement.

Abstract

As Large Language Models (LLMs) demonstrate extensive capability in learning from documents, LLM unlearning becomes an increasingly important research area to address concerns of LLMs in terms of privacy, copyright, etc. A conventional LLM unlearning task typically involves two goals: (1) The target LLM should forget the knowledge in the specified forget documents, and (2) it should retain the other knowledge that the LLM possesses, for which we assume access to a small number of retain documents. To achieve both goals, a mainstream class of LLM unlearning methods introduces an optimization framework with a combination of two objectives - maximizing the prediction loss on the forget documents while minimizing that on the retain documents, which suffers from two challenges, degenerated output and catastrophic forgetting. In this paper, we propose a novel unlearning framework called Unlearning from Logit Difference (ULD), which introduces an assistant LLM that aims to achieve the opposite of the unlearning goals: remembering the forget documents and forgetting the retain knowledge. ULD then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. We show that such reversed objectives would naturally resolve both aforementioned challenges while significantly improving the training efficiency. Extensive experiments demonstrate that our method efficiently achieves the intended forgetting while preserving the LLM's overall capabilities, reducing training time by more than threefold. Notably, our method loses 0% of model utility on the ToFU benchmark, whereas baseline methods may sacrifice 17% of utility on average to achieve comparable forget quality. Our code will be publicly available at https://github.com/UCSB-NLP-Chang/ULD.

Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference

TL;DR

The paper tackles privacy-preserving unlearning for LLMs by reframing the task as logit-difference with an assistant LLM. It trains a lightweight assistant to memorize forget data using reversed objectives and derives the unlearned model via logit subtraction, addressing degeneration and catastrophic forgetting that plague traditional forget/retain objectives. Empirical results on TOFU and HarryPotter show near-perfect forgetting with negligible loss in retain utility and a threefold reduction in training time, enabled by LoRA-based parameter efficiency and data augmentation. The approach has practical implications for safer LLM deployment and can extend to related tasks like knowledge editing and factuality improvement.

Abstract

As Large Language Models (LLMs) demonstrate extensive capability in learning from documents, LLM unlearning becomes an increasingly important research area to address concerns of LLMs in terms of privacy, copyright, etc. A conventional LLM unlearning task typically involves two goals: (1) The target LLM should forget the knowledge in the specified forget documents, and (2) it should retain the other knowledge that the LLM possesses, for which we assume access to a small number of retain documents. To achieve both goals, a mainstream class of LLM unlearning methods introduces an optimization framework with a combination of two objectives - maximizing the prediction loss on the forget documents while minimizing that on the retain documents, which suffers from two challenges, degenerated output and catastrophic forgetting. In this paper, we propose a novel unlearning framework called Unlearning from Logit Difference (ULD), which introduces an assistant LLM that aims to achieve the opposite of the unlearning goals: remembering the forget documents and forgetting the retain knowledge. ULD then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. We show that such reversed objectives would naturally resolve both aforementioned challenges while significantly improving the training efficiency. Extensive experiments demonstrate that our method efficiently achieves the intended forgetting while preserving the LLM's overall capabilities, reducing training time by more than threefold. Notably, our method loses 0% of model utility on the ToFU benchmark, whereas baseline methods may sacrifice 17% of utility on average to achieve comparable forget quality. Our code will be publicly available at https://github.com/UCSB-NLP-Chang/ULD.
Paper Structure (46 sections, 12 equations, 12 figures, 8 tables)

This paper contains 46 sections, 12 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Illustration of the logit subtraction operation. We simulate the output distribution of an unlearned LLM using the assistant LLM's output.
  • Figure 2: Illustration of constructing the assistant LLM utilizing the target LLM itself. Note that we fix the assistant LLM's parameter and only optimize the added LoRA layers.
  • Figure 3: Performance on HarryPotter dataset. R-L and Avg. Acc. denotes the ROUGE-L score and average zero-shot accuracy over six LLM benchmarks. The model before and after fine-tuning (target LLM) are included for reference. Best results are in bold for retain performance. For forget performance, no values are in bold as there is no ground-truth.
  • Figure 3: CE loss of unlearned LLM along training on the forget data $\mathcal{D}_f$ (left) and retain data not covered by $\mathcal{D}_r$ (right). The loss of ULD is evaluated on the unlearn LLM derived using logit-subtraction. We select baselines with KL retain loss in this figure. Appendix Figure \ref{['fig:full-stability-loss']} shows the full results.
  • Figure 4: Performance of different unlearn methods on ToFU-10% with different forget/retain data configurations. We include baselines with competitive forget performance here and list the full results in Appendix \ref{['sec:additional-usage-ablation']}.
  • ...and 7 more figures