Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning
Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu
TL;DR
This work tackles the vulnerability of LLM unlearning to downstream fine-tuning, where erased knowledge can be unexpectedly revived. It proposes ILU, an invariant risk minimization-based unlearning framework that regularizes the unlearning objective to be invariant across fine-tuning environments, formalized as $ \\min_{\\boldsymbol{\\theta}} \\ell_{u}(\\boldsymbol{\\theta}) + \\lambda \\sum_{i=1}^N \\| \\nabla_{w} \\ell_i (w \\\circ \\boldsymbol{\\theta})|_{w=1} \\|_2^2 $. ILU can be applied on top of state-of-the-art unlearning methods like NPO and RMU and, using a single unrelated fine-tuning dataset, demonstrates enhanced robustness to unseen downstream tasks while preserving downstream utility. Task-vector analysis provides a mechanistic explanation for ILU’s resilience by showing more stable alignment of unlearning and post-finetuning directions in weight space. Experiments on WMDP and MUSE benchmarks show ILU consistently outperforms baselines in forget quality and robust accuracy, with favorable efficiency relative to more complex defenses, indicating practical applicability for safeguarding post-training updates against unintended relearning.
Abstract
Machine unlearning offers a promising solution to privacy and safety concerns in large language models (LLMs) by selectively removing targeted knowledge while preserving utility. However, current methods are highly sensitive to downstream fine-tuning, which can quickly recover forgotten information-even from unrelated tasks. To address this, we introduce invariance into unlearning for the first time, inspired by invariant risk minimization (IRM). Building on this principle, we propose invariant LLM unlearning (ILU), a regularization-based framework that enhances robustness. Notably, ILU generalizes well to diverse fine-tuning tasks, even when trained using a single dataset. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP and MUSE benchmark, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.
