Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation
Ali Ebrahimpour-Boroojeny
TL;DR
The thesis advances reliable machine unlearning by introducing AMUN, which fine-tunes on adversarially generated near-forget samples to emulate retrained behavior; it also develops FastClip for precise spectral-norm control and LOTOS for diverse, robust ensembles. Building on this, it introduces TRW to robustly forget entire classes by tilting the retained-class distribution according to inter-class similarities, mitigating leakage identified by the MIA-NN attack. The work provides theoretical bounds linking Lipschitz properties to unlearning effectiveness and demonstrates strong empirical results across CIFAR-10/100, MNIST, and Tiny-ImageNet, often matching or surpassing retraining baselines and reducing leakage under strong MIA evaluations. Collectively, these contributions offer scalable, provably grounded, and practically effective approaches for both random-sample and class-level unlearning, with broad implications for privacy-preserving model deployment.
Abstract
We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.
