Table of Contents
Fetching ...

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys

TL;DR

The paper tackles the problem of robust, irreversible unlearning in language models by introducing MUDMAN, a pipeline that leverages meta-learning, Disruption Masking, and gradient normalization to prevent the re-emergence of dangerous capabilities. Through extensive experiments across multiple models and datasets, MUDMAN demonstrates substantial improvements over the TAR baseline and emphasizes selective intervention on model components to preserve retention. The work combines rigorous ablations with practical considerations, showing that targeted, masked, and normalized updates can yield more durable unlearning with modest overhead. These findings move toward safer deployment of LLMs by reducing the likelihood that harmful capabilities can be recovered via subsequent learning or adversarial prompts.

Abstract

Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

TL;DR

The paper tackles the problem of robust, irreversible unlearning in language models by introducing MUDMAN, a pipeline that leverages meta-learning, Disruption Masking, and gradient normalization to prevent the re-emergence of dangerous capabilities. Through extensive experiments across multiple models and datasets, MUDMAN demonstrates substantial improvements over the TAR baseline and emphasizes selective intervention on model components to preserve retention. The work combines rigorous ablations with practical considerations, showing that targeted, masked, and normalized updates can yield more durable unlearning with modest overhead. These findings move toward safer deployment of LLMs by reducing the likelihood that harmful capabilities can be recovered via subsequent learning or adversarial prompts.

Abstract

Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Disruption Masking, a technique in which we only allow updating weights, where the signs of the unlearning gradient and the retaining gradient are the same. This ensures all updates are non-disruptive. Additionally, we identify the need for normalizing the unlearning gradients, and also confirm the usefulness of meta-learning. We combine these insights into MUDMAN (Meta-Unlearning with Disruption Masking and Normalization) and validate its effectiveness at preventing the recovery of dangerous capabilities. MUDMAN outperforms the prior TAR method by 40%, setting a new state-of-the-art for robust unlearning.

Paper Structure

This paper contains 57 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The MUDMAN pipeline. When training the base model, we periodically fork out an adversary model and train it to perform well on the forget set (orange gradients). Using this adversarial model, we calculate the unlearning gradients (red) to be applied to the base model. Before applying the unlearning gradients, we normalize them and perform Disruption Masking, which zeroes out each weight's unlearning gradient if its sign differs from its retaining gradient (green).
  • Figure 2: Disruption Masking and its training dynamics.Left: regular unlearning runs where we maximize the forget loss while trying to minimize the retain loss. For each weight, we color its unlearning update green if it improves the retain loss, and red if it harms it. Right: Disruption Masking filters out all these harmful updates. This manages to raise the forget loss without impacting the retain loss.
  • Figure 3: Ablation study of MUDMAN. To establish that each part of MUDMAN is indeed necessary, we disable them one by one and measure unlearning performance in terms of the forget loss. We also compare to the state-of-the-art TAR method. The baseline is the loss level with no unlearning applied, but after the same relearning as the other methods underwent. Each bar corresponds to one Optuna hyperparameter search. The reported loss is the average of the last 30 valid trials in that search and error bars are their standard error. It shows that Disruption Masking makes a huge difference (orange vs red), and that it accounts for most of the improvement over TAR. Meta-learning and unlearning gradient normalization also help but not in every setup. Sometimes ablating them yields better performance, but insignificantly. In MUDMAN we use negative cross-entropy as the unlearning loss, due to its top performance. As discussed in Section \ref{['module_selection']}, in each experiment we focus on training the first layers of each MLP.
  • Figure 4: Accuracy on WMDP-Bio after unlearning and relearning on Pile-Bio. The base level on the right (45.5%) is the accuracy with no unlearning applied, but after relearning. Reported accuracy is the average of the last 60 valid trials in each hyperparameter search. We only show Llama, because smaller models had near-random accuracy on WMDP-Bio. Here, we use MUDMAN with negative entropy tamirisa_tamper-resistant_2024 as the unlearning loss. Similarly to Figure \ref{['fig:ablation_study']}, using Disruption Masking is crucial (orange vs red) and it accounts for most of the improvement over TAR. There is also clear improvements from meta-learning and unlearning gradient normalization.
  • Figure 5: Performance comparison across different target module configurations for unlearning. Higher values indicate better unlearning effectiveness while maintaining model capabilities. The baseline is the loss without any unlearning, but with the same relearning stage as all the methods underwent. Gate projection (and in the case of pythia its equivalent--the first MLP layer) helps most consistently. Other potential candidates for intervention are V, O and up projections. Q, K and down projections disrupt retain performance so much that is it better to omit them. In case of Pythia, Q, K and V matrices are integrated into one module, so we were not able to analyze them in separation.
  • ...and 5 more figures