Table of Contents
Fetching ...

Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu

TL;DR

This work tackles the vulnerability of LLM unlearning to downstream fine-tuning, where erased knowledge can be unexpectedly revived. It proposes ILU, an invariant risk minimization-based unlearning framework that regularizes the unlearning objective to be invariant across fine-tuning environments, formalized as $ \\min_{\\boldsymbol{\\theta}} \\ell_{u}(\\boldsymbol{\\theta}) + \\lambda \\sum_{i=1}^N \\| \\nabla_{w} \\ell_i (w \\\circ \\boldsymbol{\\theta})|_{w=1} \\|_2^2 $. ILU can be applied on top of state-of-the-art unlearning methods like NPO and RMU and, using a single unrelated fine-tuning dataset, demonstrates enhanced robustness to unseen downstream tasks while preserving downstream utility. Task-vector analysis provides a mechanistic explanation for ILU’s resilience by showing more stable alignment of unlearning and post-finetuning directions in weight space. Experiments on WMDP and MUSE benchmarks show ILU consistently outperforms baselines in forget quality and robust accuracy, with favorable efficiency relative to more complex defenses, indicating practical applicability for safeguarding post-training updates against unintended relearning.

Abstract

Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

TL;DR

This work tackles the vulnerability of LLM unlearning to downstream fine-tuning, where erased knowledge can be unexpectedly revived. It proposes ILU, an invariant risk minimization-based unlearning framework that regularizes the unlearning objective to be invariant across fine-tuning environments, formalized as . ILU can be applied on top of state-of-the-art unlearning methods like NPO and RMU and, using a single unrelated fine-tuning dataset, demonstrates enhanced robustness to unseen downstream tasks while preserving downstream utility. Task-vector analysis provides a mechanistic explanation for ILU’s resilience by showing more stable alignment of unlearning and post-finetuning directions in weight space. Experiments on WMDP and MUSE benchmarks show ILU consistently outperforms baselines in forget quality and robust accuracy, with favorable efficiency relative to more complex defenses, indicating practical applicability for safeguarding post-training updates against unintended relearning.

Abstract

Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Paper Structure

This paper contains 34 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Fine-tuning breaks existing unlearning methods. Performance evaluation of popular unlearning methods, NPO and RMU, applied to the LLM Zephyr-7b-beta for removing harmful knowledge generation on the WMDP dataset li2024wmdp. The effectiveness of unlearning is measured by the accuracy of the unlearned model on the WMDP-Bio evaluation set, with lower accuracy indicating better forgetting. Accordingly, we define 'forget quality' as '1 - evaluation accuracy', where a higher value means more effective unlearning. (Left: GSM8K fine-tuning) The trajectory of forget quality and fine-tuning accuracy is presented for various models, including NPO or RMU-unlearned models and the original (non-unlearned) model, when subjected to downstream fine-tuning on the GSM8K dataset. The fine-tuning epoch number is indicated by the color, ranging from 0 (no fine-tuning) to the number required to achieve lossless performance equivalent to full fine-tuning of the original model (termed 'Original'). The dots with the same position (e.g., 1st, 2nd, 3rd) and color across NPO's and RMU's trajectories represent the same fine-tuning epoch number. (Right: AGNews fine-tuning) Similar to the left plots but applied to fine-tuning on the AGNews downstream dataset.
  • Figure 2: A single fine-tuning dataset suffices for enhancing unlearning robustness. Forget quality and fine-tuning accuracy of different unlearned models are presented against AGNews fine-tuning, following a similar setup and presentation format to Fig. \ref{['fig: NPO_RMU_finetune_atk']}.
  • Figure 3: Graceful generalization of ILU's robustness to unseen fine-tuning tasks during evaluation. Heatmap of forget quality on WMDP is presented for RMU and its ILU variants to demonstrate unlearning robustness under various unlearning training and downstream fine-tuning settings, where the unlearning setup is consistent with Fig. \ref{['fig:RMU_ILU_multi']}, and the forget quality in each cell is reported at the final fine-tuning epoch. Each row corresponds to an unlearning training approach, while each column represents an evaluation setting (e.g., a fine-tuning dataset or no fine-tuning).
  • Figure 4: Illustration of ILU's improved unlearning robustness compared to NPO through the relationships between unlearning and fine-tuning task vectors on the WMDP dataset.
  • Figure 5: Resilience of unlearning to downstream fine-tuning across different fine-tuning epochs. The unlearning setting follows Table \ref{['tab: performance_comparison_avg']}. The first row presents the comparison between NPO and NPO+ILU(GSM8K), while the second row corresponds to the comparison between RMU and RMU+ILU(GSM8K). Each sub-plot represents a specific downstream fine-tuning dataset, with the left y-axis measuring FQ (forget quality) and the right y-axis measuring FA (fine-tuning accuracy). The x-axis denotes the fine-tuning epoch, with the maximum number set to ensure convergence and satisfactory fine-tuning performance for each downstream dataset.
  • ...and 3 more figures