From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou
TL;DR
The paper tackles how unlearning methods for vision models can be defeated by relearning attacks that reintroduce forgotten information through fine-tuning. By analyzing the problem in a controlled, example-level setting, it shows that forget-set accuracy can rebound to near-perfect levels even when relearning uses only the retain set, indicating incomplete unlearning. A weight-space perspective introduces two diagnostic tools—$L_2$ weight-space distance and Linear Mode Connectivity—to predict tamper-resistance and guides the design of a new class of methods. The proposed approaches, notably Weight Distortion and Weight Dist Reg, push the unlearned model away from the pretrained one and establish tighter barriers in weight space, achieving state-of-the-art tamper-resistance at some cost to test accuracy. The work underscores the practical importance of tamper-resistance in unlearning and provides actionable strategies for more robust deployment.
Abstract
Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a small set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically, $L_2$-distance and linear mode connectivity between the original and the unlearned model. Leveraging this insight, we propose a new class of methods that achieve state-of-the-art resistance to relearning attacks.
