Table of Contents
Fetching ...

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu

TL;DR

The paper investigates how the optimizer used for LLM unlearning influences robustness to post-unlearning weight perturbations, revealing that downgrading optimizers (e.g., zeroth-order estimation or gradient-sign-based compression) can improve resistance to relearning and quantization. It introduces the concept of optimizer grade and demonstrates that zeroth-order or compressed-first-order methods can steer optimization into basins more resilient to perturbations, though with trade-offs in unlearning precision. To balance forgetting efficacy and robustness, the authors propose a FO–ZO hybrid optimizer that alternates between first-order and zeroth-order updates, achieving strong forgetting while enhancing resilience across MUSE, WMDP, and TOFU benchmarks. The findings offer a new design principle for robust LLM unlearning, with practical impact on privacy and safety by reducing the vulnerability of forgotten information to post-unlearning manipulations.

Abstract

Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

TL;DR

The paper investigates how the optimizer used for LLM unlearning influences robustness to post-unlearning weight perturbations, revealing that downgrading optimizers (e.g., zeroth-order estimation or gradient-sign-based compression) can improve resistance to relearning and quantization. It introduces the concept of optimizer grade and demonstrates that zeroth-order or compressed-first-order methods can steer optimization into basins more resilient to perturbations, though with trade-offs in unlearning precision. To balance forgetting efficacy and robustness, the authors propose a FO–ZO hybrid optimizer that alternates between first-order and zeroth-order updates, achieving strong forgetting while enhancing resilience across MUSE, WMDP, and TOFU benchmarks. The findings offer a new design principle for robust LLM unlearning, with practical impact on privacy and safety by reducing the vulnerability of forgotten information to post-unlearning manipulations.

Abstract

Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the 'grade' of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.

Paper Structure

This paper contains 19 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Unlearning performance under 4-bit weight quantization using NPO on MUSE with different optimizers (Sophia, Adam, 8-bit Adam, and 1-bit Adam). Performance is measured by unlearning effectiveness (VerbMem and KnowMem on $\mathcal{D}_{\mathrm{f}}$, left plots in each sub-figure) and utility (KnowMem on $\mathcal{D}_{\mathrm{r}}$, right plots in each sub-figure). "Pre-unlearn" represents the target model to conduct unlearning, and "before Q" (the circle) and "after Q" (the diamond) represent the unlearned models before and after 4-bit weight quantization. (a) Unlearning on MUSE-News. (b) Unlearning on MUSE-Books.
  • Figure 2: On MUSE-Books, (a-b): Unlearning performance under 4-bit weight quantization using GradDiff and NPO with different optimizers (Adam, signSGD, signAdam, (FO) RS, ZO method). The figure format is consistent with Fig. \ref{['fig:pre-quant']}. (c-d): Unlearn performance with relearning 100 steps ("Relearn100"), using GradDiff and NPO with different optimizers.
  • Figure 3: Linear mode connectivity (LMC) between downgraded optimizers (signSGD, signAdam, RS, and ZO) and Adam on MUSE-Books, using NPO.
  • Figure 4: (a–b): Unlearning performance before and after 4-bit quantization on MUSE-Books using GradDiff and NPO with optimizers Adam, SAM, signAdam, and Hybrid FO–ZO. (c–d): GradDiff and NPO on MUSE-Books under different optimizers against "Relearn100" (100 relearning steps). The figure format follows Fig. \ref{['fig:degrade-quant']}.
  • Figure 5: Unlearning performance and relearning robustness of RMU and NPO on WMDP-Bio using different optimizers (Adam, signAdam, ZO, SAM, and Hybrid). Relearning is conducted by fine-tuning the unlearned model on 40 forget data samples across multiple epochs. (a) Unlearning effectiveness and utility retention of RMU without relearning; (b) NPO without relearning; (c) RMU across different relearning epochs; (d) NPO across different relearning epochs.
  • ...and 6 more figures