Table of Contents
Fetching ...

ROKA: Robust Knowledge Unlearning against Adversaries

Jinmyeong Shin, Joshua Tapia, Nicholas Ferreira, Gabriel Diaz, Moayed Daneshyari, Hyeran Jeon

TL;DR

This study introduces a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions and proposes ROKA, a robust unlearning strategy centered on Neural Healing.

Abstract

The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.

ROKA: Robust Knowledge Unlearning against Adversaries

TL;DR

This study introduces a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions and proposes ROKA, a robust unlearning strategy centered on Neural Healing.

Abstract

The need for machine unlearning is critical for data privacy, yet existing methods often cause Knowledge Contamination by unintentionally damaging related knowledge. Such a degraded model performance after unlearning has been recently leveraged for new inference and backdoor attacks. Most studies design adversarial unlearning requests that require poisoning or duplicating training data. In this study, we introduce a new unlearning-induced attack model, namely indirect unlearning attack, which does not require data manipulation but exploits the consequence of knowledge contamination to perturb the model accuracy on security-critical predictions. To mitigate this attack, we introduce a theoretical framework that models neural networks as Neural Knowledge Systems. Based on this, we propose ROKA, a robust unlearning strategy centered on Neural Healing. Unlike conventional unlearning methods that only destroy information, ROKA constructively rebalances the model by nullifying the influence of forgotten data while strengthening its conceptual neighbors. To the best of our knowledge, our work is the first to provide a theoretical guarantee for knowledge preservation during unlearning. Evaluations on various large models, including vision transformers, multi-modal models, and large language models, show that ROKA effectively unlearns targets while preserving, or even enhancing, the accuracy of retained data, thereby mitigating the indirect unlearning attacks.
Paper Structure (34 sections, 15 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 34 sections, 15 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of our proposed Indirect Unlearning Attack. This figure demonstrates how an unlearning request for one subject can compromise the security of another. (a) Initially, a hypothetical face recognition system correctly approves an authorized user (Gaby Espino) while denying an unauthorized one (Rick Astley). (b) An adversary then submits a request to unlearn a seemingly unrelated individual (Kate Nash). (c) After a conventional unlearning process, the model's knowledge is contaminated while it still correctly approves Gaby Espino, its ability to recognize Rick Astley is degraded, causing it to incorrectly grant him access and thus compromising the system's security.
  • Figure 2: Imbalanced prediction impacts from unlearning: $C_{unlean}^{adv}$ is the unlearned data class, $C_{target}^{adv}$ is the data class that exhibits the most severe accuracy drop among those that were not unlearned. Our proposed Indirect unlearning attack aims to request to unlearn $C_{unlean}^{adv}$ to compromise the accuracy of $C_{target}^{adv}$, where $C_{target}^{adv}$ is assume to be a security-critical class.
  • Figure 3: Contribution Re-allocation Procedure: When unlearning a knowledge encoded in $K_{l,2}$, the nullified target information weakens the contribution of the next layer neurons, which results in an indirect unlearning of the knowledge delivered with sibling neurons. Our proposed Neural Healing reallocates the original contribution of $K_{l,2}$ to the sibling neurons, $K_{l,1}$ and $K_{l,3}$, to preserve the total contribution, thereby the reallocated extra contribution factor can strengthen (maintain) their knowledge.
  • Figure 4: Unlearning Stability Comparison on CIFAR-100 and Tiny-Imagenet: The plots show the accuracy trends for the forget set (dashed lines) and the retain set (solid lines) over 200 unlearning iterations. GA indicates the results of the gradient-ascending-based unlearning, and STU and SNTU show the results of ROKA using stochastic targeted unlearning and stochastic non-targeted unlearning, respectively.

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5