Table of Contents
Fetching ...

Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy

Jamie Hayes, Ilia Shumailov, Eleni Triantafillou, Amr Khalifa, Nicolas Papernot

TL;DR

The paper argues that evaluating privacy in inexact machine unlearning using population membership inference attacks overestimates protection. By introducing U-LiRA, a per-example LiRA-based attack adapted to the unlearning setting, the authors show persistent privacy leakage across both vision and language models, with substantial variation across forget and retain examples. They benchmark multiple unlearning methods, revealing that many reduce leakage for some points but increase it for others, and that retain-set privacy can deteriorate after unlearning. The findings call for formal threat models, per-example adversaries, and calibrated stopping criteria to reliably assess unlearning privacy in practice, influencing how future unlearning techniques are designed and evaluated.

Abstract

The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their "U-MIA" counterparts). We propose a categorization of existing U-MIAs into "population U-MIAs", where the same attacker is instantiated for all examples, and "per-example U-MIAs", where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.

Inexact Unlearning Needs More Careful Evaluations to Avoid a False Sense of Privacy

TL;DR

The paper argues that evaluating privacy in inexact machine unlearning using population membership inference attacks overestimates protection. By introducing U-LiRA, a per-example LiRA-based attack adapted to the unlearning setting, the authors show persistent privacy leakage across both vision and language models, with substantial variation across forget and retain examples. They benchmark multiple unlearning methods, revealing that many reduce leakage for some points but increase it for others, and that retain-set privacy can deteriorate after unlearning. The findings call for formal threat models, per-example adversaries, and calibrated stopping criteria to reliably assess unlearning privacy in practice, influencing how future unlearning techniques are designed and evaluated.

Abstract

The high cost of model training makes it increasingly desirable to develop techniques for unlearning. These techniques seek to remove the influence of a training example without having to retrain the model from scratch. Intuitively, once a model has unlearned, an adversary that interacts with the model should no longer be able to tell whether the unlearned example was included in the model's training set or not. In the privacy literature, this is known as membership inference. In this work, we discuss adaptations of Membership Inference Attacks (MIAs) to the setting of unlearning (leading to their "U-MIA" counterparts). We propose a categorization of existing U-MIAs into "population U-MIAs", where the same attacker is instantiated for all examples, and "per-example U-MIAs", where a dedicated attacker is instantiated for each example. We show that the latter category, wherein the attacker tailors its membership prediction to each example under attack, is significantly stronger. Indeed, our results show that the commonly used U-MIAs in the unlearning literature overestimate the privacy protection afforded by existing unlearning techniques on both vision and language models. Our investigation reveals a large variance in the vulnerability of different examples to per-example U-MIAs. In fact, several unlearning algorithms lead to a reduced vulnerability for some, but not all, examples that we wish to unlearn, at the expense of increasing it for other examples. Notably, we find that the privacy protection for the remaining training examples may worsen as a consequence of unlearning. We also discuss the fundamental difficulty of equally protecting all examples using existing unlearning schemes, due to the different rates at which examples are unlearned. We demonstrate that naive attempts at tailoring unlearning stopping criteria to different examples fail to alleviate these issues.
Paper Structure (46 sections, 3 equations, 13 figures, 1 table, 1 algorithm)

This paper contains 46 sections, 3 equations, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: Membership inference attack accuracy using a baseline attack and U-LiRA across different unlearning algorithms. Attack and unlearning algorithm descriptions are in \ref{['sec:benchmark']}. U-LiRA outperforms the baseline by a large margin across all unlearning algorithms because it creates per-example MIA decision rules.
  • Figure 2: For each example in class 5 of the CIFAR-10 training set, we compute its average predicted membership probability (the predicted probability of being a member of training as output by U-LiRA) over 100 target models where this input was included in the forget set. We do this before and after unlearning, and compute the difference. We then compute the empirical CDF over all examples. If CDF quickly approached 100% with only negative differences, this would imply that almost all examples have a reduction in privacy leakage after unlearning, as the posterior membership risk is smaller than the prior for most examples.
  • Figure 3: Empirical CDF of predicted membership probability before and after unlearning for all examples in class 5 of CIFAR-10 when they are included in the retain set.
  • Figure 4: Comparison of U-LiRA and the baseline U-MIA where we unlearn (with GradDesc) data memorized from fine-tuning PaLM 2 palm2.
  • Figure 5: (Left) The shadow model distributions for a specific example from CIFAR-10. We plot a histogram of the rescaled logit values of this example when evaluated over shadow models that contained this example in the forget set, and shadow models that did not contain this example in the training set. We fit two Gaussians over the empirical distributions as described by \ref{['alg:adapted_lira']}. (Right) We train 100 target models including this example in the forget set and 100 target models excluding this example from training. The average predicted membership probability over all occasions when it was a member of the forget set is $>0.9$, and when it was not a member the average probability is close to zero. This is unsurprising because the shadow member and non-member distributions are well separated, making membership prediction an easy task. However, the are numerous errors across target models, where the example was incorrectly predicted as in the forget set or out of training.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 3.1