Table of Contents
Fetching ...

Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation

Ali Ebrahimpour-Boroojeny

TL;DR

The thesis advances reliable machine unlearning by introducing AMUN, which fine-tunes on adversarially generated near-forget samples to emulate retrained behavior; it also develops FastClip for precise spectral-norm control and LOTOS for diverse, robust ensembles. Building on this, it introduces TRW to robustly forget entire classes by tilting the retained-class distribution according to inter-class similarities, mitigating leakage identified by the MIA-NN attack. The work provides theoretical bounds linking Lipschitz properties to unlearning effectiveness and demonstrates strong empirical results across CIFAR-10/100, MNIST, and Tiny-ImageNet, often matching or surpassing retraining baselines and reducing leakage under strong MIA evaluations. Collectively, these contributions offer scalable, provably grounded, and practically effective approaches for both random-sample and class-level unlearning, with broad implications for privacy-preserving model deployment.

Abstract

We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.

Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation

TL;DR

The thesis advances reliable machine unlearning by introducing AMUN, which fine-tunes on adversarially generated near-forget samples to emulate retrained behavior; it also develops FastClip for precise spectral-norm control and LOTOS for diverse, robust ensembles. Building on this, it introduces TRW to robustly forget entire classes by tilting the retained-class distribution according to inter-class similarities, mitigating leakage identified by the MIA-NN attack. The work provides theoretical bounds linking Lipschitz properties to unlearning effectiveness and demonstrates strong empirical results across CIFAR-10/100, MNIST, and Tiny-ImageNet, often matching or surpassing retraining baselines and reducing leakage under strong MIA evaluations. Collectively, these contributions offer scalable, provably grounded, and practically effective approaches for both random-sample and class-level unlearning, with broad implications for privacy-preserving model deployment.

Abstract

We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.

Paper Structure

This paper contains 111 sections, 9 theorems, 54 equations, 21 figures, 21 tables, 4 algorithms.

Key Result

Proposition 5.2.1

Assume $\mathcal{X} = [0,1]^d$ and $\|\delta_x\| \leq \epsilon$. For two models $\mathcal{F}$ and $\mathcal{G}$, if the loss function on both for any $y \in \mathcal{Y}$ is $L$-Lipschitz with respect to the inputs, we have the following inequality:

Figures (21)

  • Figure 1: Accuracy vs. Robust Accuracy vs. Transferability: Changes in the average accuracy and robust accuracy of individual ResNet-18 models, along with the average transferability rate between any pair of the models in each ensemble as the layer-wise clipping value (spectral norm) changes. As the plots show, although the robustness of individual models increases with decreasing the clipping value, the transferability rate among the models increases, which might forfeit the benefits of the clipping in the robustness of the whole ensemble.
  • Figure 2: (a) The first three plots show the clipping of the convolutional layer in a simple two-layer network to various values on MNIST. As the clipping target gets smaller, the spectral norm of the batch norm layer compensates and becomes larger. Meanwhile, the spectral norm of their concatenation slightly decreases. (b) The right-most plot shows the spectral norm of a convolutional layer, its succeeding batch norm layer, and their concatenation from the clipped ResNet-18 model trained on CIFAR-10. Although the convolutional layer is clipped to $1$, the spectral norm of the concatenation is much larger due to the presence of the batch norm layer.
  • Figure 3: The layer-wise spectral norm of a ResNet-18 model trained on CIFAR-10 (a) and MNIST (b) using each of the clipping methods. The time columns shows the training time per epoch for these methods. c. The layer-wise spectral norm of a DLA model trained on CIFAR-10 using each of the clipping methods. The time column shows the training time per epoch for these methods. As all of the plots show, by using our method, all the layers have a spectral norm very close to the target value $\pmb{1}$. Our method is also much faster than the relatively accurate alternatives and shows a slower increase in running time as the model gets larger.
  • Figure 4: Comparison of the clipping methods in a simple network with only one convolutional layer and one dense layer, where the target value is $\pmb{1}$. Our method is the only one that clips this layer correctly for all different settings: 1. Kernel of size $3$ with reflect padding, 2. Kernel of size $3$ with same padding, 3. Kernel of size $3$ and zeros padding with stride of $2$, and 4. Kernel of size $5$ with same replicate padding and stride of $2$.
  • Figure 5: (a) Each of these three subplots shows the spectral norms of a convolutional layer, its succeeding batch norm layer, and their concatenation in a ResNet-18 model trained on CIFAR-10. The convolutional layers in this model are clipped to $1$. Instead of clipping the batch normalization layer, our method has been applied to the concatenation to control its spectral norm. (b) The rightmost subplot shows the training accuracy for the ResNet-18 model that is trained on CIFAR-10. One curve belongs to the model with the convolutional layers clipped to $1$ using FastClip and the batch norm layers clipped using the direct method used by prior works (FastClip-clip BN). The other two belong to FastClip and FastClip-concat.
  • ...and 16 more figures

Theorems & Definitions (22)

  • Definition 5.1.1: Attack Algorithm
  • Definition 5.1.2: Transferability Rate
  • Proposition 5.2.1
  • Proposition 5.3.1
  • Theorem 5.4.1
  • Remark 5.4.2
  • Theorem 5.4.3
  • Definition 6.1.1: Attack Algorithm
  • Definition 6.1.2: Machine Unlearning
  • Theorem 6.4.1
  • ...and 12 more