Table of Contents
Fetching ...

Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

Nakyeong Yang, Dong-Kyum Kim, Jea Kwon, Minsung Kim, Kyomin Jung, Meeyoung Cha

TL;DR

Ssiuu is introduced, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge and highlights the necessity of robust and faithful unlearning methods for safe deployment of language models.

Abstract

Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.

Erase or Hide? Suppressing Spurious Unlearning Neurons for Robust Unlearning

TL;DR

Ssiuu is introduced, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge and highlights the necessity of robust and faithful unlearning methods for safe deployment of language models.

Abstract

Large language models trained on web-scale data can memorize private or sensitive knowledge, raising significant privacy risks. Although some unlearning methods mitigate these risks, they remain vulnerable to "relearning" during subsequent training, allowing a substantial portion of forgotten knowledge to resurface. In this paper, we show that widely used unlearning methods cause shallow alignment: instead of faithfully erasing target knowledge, they generate spurious unlearning neurons that amplify negative influence to hide it. To overcome this limitation, we introduce Ssiuu, a new class of unlearning methods that employs attribution-guided regularization to prevent spurious negative influence and faithfully remove target knowledge. Experimental results confirm that our method reliably erases target knowledge and outperforms strong baselines across two practical retraining scenarios: (1) adversarial injection of private data, and (2) benign attack using an instruction-following benchmark. Our findings highlight the necessity of robust and faithful unlearning methods for safe deployment of language models.

Paper Structure

This paper contains 36 sections, 14 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Experiments on Retraining Attacks with FaithUn. Their accuracy on the forget set before the attacks is 0%.
  • Figure 2: Influence Variations after Unlearning. After unlearning, most models show that negative influence variations are substantially greater than positive influence variations. In Figure 3-(b), the X-axis denotes the number of accumulated neurons sorted by their scores, and the Y-axis indicates accumulated influence variations. The solid and dotted lines express negative and positive ones, respectively. Figure 3-(c) shows variations extracted from Figure 3-(b) over 100 neurons.
  • Figure 3: Analyzing Excessive Knowledge Removal via Logit Lens. The X-axis and Y-axis correspond to layer indices and accuracy, respectively. The blue dotted line represents the random-choice baseline (binary classification). GD tends to excessively unlearn target knowledge, whereas Ssiuu adequately unlearns it to the random-choice level.
  • Figure 4: Influence Variation for Each Module and Layer. We plot positive and negative influence variations of GD and Ssiuu for each module and layer. X-axis and Y-axis correspond to layer indices and module type, respectively. The color scale indicates the average variation in influence for the top-$100$ neurons in each module.
  • Figure 5: Deeper Investigations into Influence Distributions. We present the influence (attribution) changes after the harmful attack ($p=0.1$). Figure \ref{['fig:fig_influence_after_attack']}-(a) illustrates the attributions of the original model, unlearned models, and models after the attack. Figure \ref{['fig:fig_influence_after_attack']}-(b) presents the correlation between attributions before and after the attacks. While models trained with other methods exhibit high variability, our method yields relatively consistent distributions with a strong correlation.
  • ...and 1 more figures