Table of Contents
Fetching ...

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou

TL;DR

The paper tackles how unlearning methods for vision models can be defeated by relearning attacks that reintroduce forgotten information through fine-tuning. By analyzing the problem in a controlled, example-level setting, it shows that forget-set accuracy can rebound to near-perfect levels even when relearning uses only the retain set, indicating incomplete unlearning. A weight-space perspective introduces two diagnostic tools—$L_2$ weight-space distance and Linear Mode Connectivity—to predict tamper-resistance and guides the design of a new class of methods. The proposed approaches, notably Weight Distortion and Weight Dist Reg, push the unlearned model away from the pretrained one and establish tighter barriers in weight space, achieving state-of-the-art tamper-resistance at some cost to test accuracy. The work underscores the practical importance of tamper-resistance in unlearning and provides actionable strategies for more robust deployment.

Abstract

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

TL;DR

The paper tackles how unlearning methods for vision models can be defeated by relearning attacks that reintroduce forgotten information through fine-tuning. By analyzing the problem in a controlled, example-level setting, it shows that forget-set accuracy can rebound to near-perfect levels even when relearning uses only the retain set, indicating incomplete unlearning. A weight-space perspective introduces two diagnostic tools— weight-space distance and Linear Mode Connectivity—to predict tamper-resistance and guides the design of a new class of methods. The proposed approaches, notably Weight Distortion and Weight Dist Reg, push the unlearned model away from the pretrained one and establish tighter barriers in weight space, achieving state-of-the-art tamper-resistance at some cost to test accuracy. The work underscores the practical importance of tamper-resistance in unlearning and provides actionable strategies for more robust deployment.

Abstract

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, -distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.

Paper Structure

This paper contains 23 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Fine-tuning an unlearned model on just the retain set recovers performance on the forget set! Results on CIFAR-10 using a forget set of atypical examples from class 'airplane'.
  • Figure 2: Scatter-plots with test-set accuracy on the x-axis and accuracy on the held-out portion of the forget set, $\mathcal{D}_{F_{ho}}$, on the y-axis. The left-most subplot indicates performance immediately following unlearning. The next three subplots are following a relearning attack with instances of the retain set, $\mathcal{D}_R$ and a varying number of instances of the forget set, $\mathcal{D}_{F_{re}}$ (0, 10, and 100, respectively). Each point is the average performance of the last 50 steps (see \ref{['fig:relearning_forget_cifar10_resnet18_high_mem_evolution']} for the whole trajectory for sub-class unlearning and \ref{['fig:relearning_forget_cifar10_resnet18_high_mem_all_cls_remaining']} for the trajectory for class-agnostic unlearning). The forget set is comprised of atypical examples (from the 'airplane' class, i.e., sub-class unlearning for the top row and all classes, i.e., class-agnostic unlearning in the bottom row) in CIFAR-10. The figure indicates that many methods achieve near-perfect recovery of unlearned knowledge with only a small amount of model fine-tuning, even with 0 relearning examples (fine-tuning on only the retain set). Weight Distortion, CBFT, and Weight Dist Reg are introduced in \ref{['sec:weight_space_analysis']}.
  • Figure 3: Comparison between test set accuracy and accuracy on the held-out part of the forget set $\mathcal{D}_{F_{ho}}$ after relearning, for subclass unlearning of atypical examples in CIFAR-10. We consider two-phase unlearning methods: first, an initial safeguard (unlearning phase) is applied, with the unlearning algorithm mentioned as the subplot title. Then, each of TAR, CBFT, and Weight Dist Reg are applied as a second phase for increasing the tamper-resistance. The '+' symbol represents the performance of the initial safeguard for reference. We observe that TAR fails to add any tamper-resistance in addition to that of the initial safeguard despite being designed for this.
  • Figure 4: Linear mode connectivity analysis on CIFAR-10, where the forget set is comprised of atypical examples. We construct a linear path between the pretrained and the unlearned (or retrained-from-scratch) model by interpolating the model parameters and batch-norm statistics using different mixing weights (shown on the x-axis). We report accuracy on the y-axis. 0 on the x-axis represents the pretrained model, while 1 represents the unlearned or retrained model. Retrain-from-scratch is not linearly connected to the pretrained model, whereas for unlearning algorithms, the resulting unlearned model is in many cases still linearly connected to the pretrained one.
  • Figure 5: $L_2$ norm of the difference between the parameters of the pretrained and the unlearned models induced by different unlearning methods. We consider only the difference in the parameters, while ignoring the batch-norm statistics for ResNet-18 trained on CIFAR-10, where the forget set is comprised of atypical examples.
  • ...and 8 more figures