Table of Contents
Fetching ...

Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

Hongbang Yuan, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR

This work tackles the robustness problem of unlearning in large language models by first exposing vulnerabilities through Dynamic Unlearning Attack (DUA), which learns universal adversarial suffixes to resurrect forgotten knowledge across scenarios. It then introduces Latent Adversarial Unlearning (LAU), a universal, min–max framework that perturbs latent representations to both attack and defend the unlearning process, yielding two robust methods, AdvGA and AdvNPO. Across RWKU and MUSE benchmarks and several LLM variants, LAU-augmented methods achieve substantial gains in forgetting effectiveness (over 53% improvement) while minimizing disruption to neighboring knowledge and preserving general capabilities. The framework provides actionable insights into how perturbation layer selection and inner optimization steps shape robustness, and demonstrates practical pathways to deploy safer, more reliable unlearning in real-world systems.

Abstract

LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in $55.2\%$ of the questions, even without revealing the unlearned model's parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model's robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over $53.5\%$, cause only less than a $11.6\%$ reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.

Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

TL;DR

This work tackles the robustness problem of unlearning in large language models by first exposing vulnerabilities through Dynamic Unlearning Attack (DUA), which learns universal adversarial suffixes to resurrect forgotten knowledge across scenarios. It then introduces Latent Adversarial Unlearning (LAU), a universal, min–max framework that perturbs latent representations to both attack and defend the unlearning process, yielding two robust methods, AdvGA and AdvNPO. Across RWKU and MUSE benchmarks and several LLM variants, LAU-augmented methods achieve substantial gains in forgetting effectiveness (over 53% improvement) while minimizing disruption to neighboring knowledge and preserving general capabilities. The framework provides actionable insights into how perturbation layer selection and inner optimization steps shape robustness, and demonstrates practical pathways to deploy safer, more reliable unlearning in real-world systems.

Abstract

LLM have achieved success in many fields but still troubled by problematic content in the training corpora. LLM unlearning aims at reducing their influence and avoid undesirable behaviours. However, existing unlearning methods remain vulnerable to adversarial queries and the unlearned knowledge resurfaces after the manually designed attack queries. As part of a red-team effort to proactively assess the vulnerabilities of unlearned models, we design Dynamic Unlearning Attack (DUA), a dynamic and automated framework to attack these models and evaluate their robustness. It optimizes adversarial suffixes to reintroduce the unlearned knowledge in various scenarios. We find that unlearned knowledge can be recovered in of the questions, even without revealing the unlearned model's parameters. In response to this vulnerability, we propose Latent Adversarial Unlearning (LAU), a universal framework that effectively enhances the robustness of the unlearned process. It formulates the unlearning process as a min-max optimization problem and resolves it through two stages: an attack stage, where perturbation vectors are trained and added to the latent space of LLMs to recover the unlearned knowledge, and a defense stage, where previously trained perturbation vectors are used to enhance unlearned model's robustness. With our LAU framework, we obtain two robust unlearning methods, AdvGA and AdvNPO. We conduct extensive experiments across multiple unlearning benchmarks and various models, and demonstrate that they improve the unlearning effectiveness by over , cause only less than a reduction in neighboring knowledge, and have almost no impact on the model's general capabilities.
Paper Structure (44 sections, 10 equations, 4 figures, 4 tables)

This paper contains 44 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Our work focuses on assessing the robustness of unlearned models by training adversarial suffixes and enhancing the robustness of the unlearning process through latent adversarial unlearning. In this figure, we show (a) An example of unlearning J.K.Rowling; (b) An example of static and dynamic unlearning attack; (c) A framework of latent adversarial unlearning.
  • Figure 2: Experimental results of our dynamic attack framework. We report the ROUGE-L recall score (%) in this figure.
  • Figure 3: Influence of the perturb layers and the inner optimization steps. We report the ROUGE-L recall score (%).
  • Figure 4: Robustness evaluation of AdvNPO. We report the performance change ($\Delta$) in terms of the ROUGE-L recall score (%) compared to the scenario without attack.