Table of Contents
Fetching ...

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, Sijia Liu

TL;DR

The paper addresses the challenge of unlearning unwanted data in LLMs while preserving utility, criticizing gradient ascent and NPO's dependence on a reference model. It proposes SimNPO, a simple, reference-free, length-normalized preference optimization framework, to better allocate unlearning effort across forget data and stabilize early optimization. Empirical results on TOFU, MUSE, and WMDP show that SimNPO improves forget quality and utility over NPO and demonstrates robustness to relearning attacks, supported by a synthetic analysis based on a mixture of Markov chains. The work provides both practical gains and theoretical intuition for simpler, safer LLM unlearning, with public code and extensive ablations. Overall, SimNPO offers a more reliable and scalable path to removing unwanted memoranda from LLMs without the risks associated with reference-model-dependent objectives.

Abstract

This work studies the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences (e.g., copyrighted or harmful content) while preserving model utility. Despite the increasing demand for unlearning, a technically-grounded optimization framework is lacking. Gradient ascent (GA)-type methods, though widely used, are suboptimal as they reverse the learning process without controlling optimization divergence (i.e., deviation from the pre-trained state), leading to risks of over-forgetting and potential model collapse. Negative preference optimization (NPO) has been proposed to address this issue and is considered one of the state-of-the-art LLM unlearning approaches. In this work, we revisit NPO and identify another critical issue: reference model bias. This bias arises from using the reference model (i.e., the model prior to unlearning) to evaluate the unlearning success, which can compromise NPO's effectiveness. Specifically, it leads to (a) uneven allocation of optimization power across forget data with varying difficulty levels and (b) ineffective gradient weight smoothing during the early stages of unlearning optimization. To overcome these challenges, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that `simplicity' in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We provide deeper insights into SimNPO's advantages through an analysis based on mixtures of Markov chains. Extensive experiments further validate SimNPO's efficacy on benchmarks like TOFU and MUSE, as well as its robustness against relearning attacks. Codes are available at https://github.com/OPTML-Group/Unlearn-Simple.

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

TL;DR

The paper addresses the challenge of unlearning unwanted data in LLMs while preserving utility, criticizing gradient ascent and NPO's dependence on a reference model. It proposes SimNPO, a simple, reference-free, length-normalized preference optimization framework, to better allocate unlearning effort across forget data and stabilize early optimization. Empirical results on TOFU, MUSE, and WMDP show that SimNPO improves forget quality and utility over NPO and demonstrates robustness to relearning attacks, supported by a synthetic analysis based on a mixture of Markov chains. The work provides both practical gains and theoretical intuition for simpler, safer LLM unlearning, with public code and extensive ablations. Overall, SimNPO offers a more reliable and scalable path to removing unwanted memoranda from LLMs without the risks associated with reference-model-dependent objectives.

Abstract

This work studies the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences (e.g., copyrighted or harmful content) while preserving model utility. Despite the increasing demand for unlearning, a technically-grounded optimization framework is lacking. Gradient ascent (GA)-type methods, though widely used, are suboptimal as they reverse the learning process without controlling optimization divergence (i.e., deviation from the pre-trained state), leading to risks of over-forgetting and potential model collapse. Negative preference optimization (NPO) has been proposed to address this issue and is considered one of the state-of-the-art LLM unlearning approaches. In this work, we revisit NPO and identify another critical issue: reference model bias. This bias arises from using the reference model (i.e., the model prior to unlearning) to evaluate the unlearning success, which can compromise NPO's effectiveness. Specifically, it leads to (a) uneven allocation of optimization power across forget data with varying difficulty levels and (b) ineffective gradient weight smoothing during the early stages of unlearning optimization. To overcome these challenges, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that `simplicity' in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We provide deeper insights into SimNPO's advantages through an analysis based on mixtures of Markov chains. Extensive experiments further validate SimNPO's efficacy on benchmarks like TOFU and MUSE, as well as its robustness against relearning attacks. Codes are available at https://github.com/OPTML-Group/Unlearn-Simple.

Paper Structure

This paper contains 29 sections, 11 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Systematic overview of an LLM (${\boldsymbol{\theta}}$) post-unlearning using the proposed SimNPO, compared to NPO zhang2024negative and the reference model. (b) Truth ratio distribution of strongly-memorized forget data points and weakly-memorized data for NPO, SimNPO, and Retrain on the TOFU Forget05 dataset maini2024tofu under LLaMA-2-chat 7B; See Sec. \ref{['sec: NPO_limitations']} for more details. As shown, SimNPO achieves better forget quality (FQ, the number after method) than NPO and exhibits a truth ratio distribution closer to Retrain. Note that FQ is a statistical measure quantifying the closeness between the truth ratio distribution of an unlearned model and that of Retrain (with FQ$=1$ representing optimal unlearning). (c) & (d) Experiment highlights on TOFU Forget05 and MUSE News datasets shi2024muse. Unlearning effectiveness is measured by FQ for TOFU and PrivLeak for MUSE, while utility preservation is evaluated using model utility for TOFU and KnowMem on retain data for MUSE (see Table \ref{['tab:tasks_evaluations']}). In both tasks, Retrain is the gold standard for unlearning.
  • Figure 2: Truth ratio distribution of short/long forget data for NPO, SimNPO, and Retrain on TOFU Forget05. The figure format follows Fig. \ref{['fig: intro_fig']}-(b).
  • Figure 3: Experimental evidence of ineffective weight smoothing and utility-drop for NPO on TOFU Forget05 (a) NPO's gradient weights ($w_{\boldsymbol{\theta}}$) at epoch 1 vs. response length $|y|$. (b) Trajectory of $w_{\boldsymbol{\theta}}$ for NPO over unlearning epochs, where box plot represents the distribution of gradient weights over forget samples. (c)-(d) Forget quality and model utility of NPO vs. epochs.
  • Figure 4: Gradient weight smoothing of NPO ($w_{\boldsymbol{\theta}}$) and SimNPO ($w_{\boldsymbol{\theta}}^\prime$) vs. forget data response length $|y|$ across different epochs (1, 2, 3, and 10) on TOFU Forget05. The Pearson correlation in the upper right corner indicates the relationship between gradient weight smoothing and response length. The SimNPO's weights $w_{\boldsymbol{\theta}}^\prime$ have been rescaled (by $\times 10$) for ease of visualization.
  • Figure 5: Tradeoffs between forget quality (higher $\uparrow$ is better) and retain distance (lower $\downarrow$ is better) along the unlearning path of NPO and SimNPO in the synthetic experiments. The symbols $(\star, \bullet)$ near the $y$-axis of both figures indicate the performance of the retrained model on Forget1 and Forget2, respectively.
  • ...and 6 more figures