Table of Contents
Fetching ...

On the limitation of evaluating machine unlearning using only a single training seed

Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma

TL;DR

This work demonstrates that evaluating machine unlearning using a single training seed can yield misleading conclusions, particularly for deterministic MU methods. It develops a formal, seed-aware evaluation framework and shows that total unlearning variability comprises training-seed and unlearning-seed contributions, with the former captured by Var[E(Z|S)] in the variance decomposition $\mathrm{Var}[Z] = \mathrm{Var}[\mathbb{E}(Z|S)] + \mathbb{E}[\mathrm{Var}(Z|S)]$. Empirically, across image classification (CIFAR-100/20 with ResNet-18) and federated online learning to rank (MQ2007 LETOR) tasks, single-seed evaluations of MU methods, especially deterministic ones like SSD and LFSSD, can be highly non-representative; multiple training seeds cannot be fully compensated by multiple unlearning seeds. The findings advocate for reporting results over multiple training seeds and extend the validity of seed-aware evaluation beyond image tasks, including FOLTR, with practical implications for robust MU benchmarking and reproducibility.

Abstract

Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We illustrate that this is particularly relevant for MU methods that are deterministic, i.e., which always produce the same result when started from the same trained model. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.

On the limitation of evaluating machine unlearning using only a single training seed

TL;DR

This work demonstrates that evaluating machine unlearning using a single training seed can yield misleading conclusions, particularly for deterministic MU methods. It develops a formal, seed-aware evaluation framework and shows that total unlearning variability comprises training-seed and unlearning-seed contributions, with the former captured by Var[E(Z|S)] in the variance decomposition . Empirically, across image classification (CIFAR-100/20 with ResNet-18) and federated online learning to rank (MQ2007 LETOR) tasks, single-seed evaluations of MU methods, especially deterministic ones like SSD and LFSSD, can be highly non-representative; multiple training seeds cannot be fully compensated by multiple unlearning seeds. The findings advocate for reporting results over multiple training seeds and extend the validity of seed-aware evaluation beyond image tasks, including FOLTR, with practical implications for robust MU benchmarking and reproducibility.

Abstract

Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We illustrate that this is particularly relevant for MU methods that are deterministic, i.e., which always produce the same result when started from the same trained model. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.

Paper Structure

This paper contains 8 sections, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Retain-set accuracy (Panel \ref{['fig:1_a']}) and forget-set accuracy (Panel \ref{['fig:1_b']}), when using one training seed (combined with eleven unlearning seeds) versus eleven training seeds (each paired with one unlearning seed). Black lines represent the 25, 50 and 75% quantiles of the accuracy of Retrain (the gold standard) across eleven seeds; whiskers extend to the full data range for clarity. This figure shows that the performance of the deterministic unlearning methods: SSD and LFSSD, is highly sensitive to choice of the training seed. Thus, empirical evaluation of (especially such deterministic) unlearning methods should be based on multiple training seeds.
  • Figure 2: 2-Wasserstein distance of the empirical distribution (per unlearning class) of the unlearning metrics: retain-set accuracy and forget-set accuracy, under the common-practice approach ($I = 1$ & $J = 11$) vs our recommendation ($I = 11$ & $J = 1$). This figure illustrates that the results from Figure \ref{['fig:combined_specific']} extend to other unlearning classes and datasets, i.e., that considering only a single training seed can lead to a highly non-representative unlearning-metric distribution, especially for SSD and LFSSD.
  • Figure 3: FOLTR on MQ2007 model poison unlearning scenario. Offline NDCG@10 (evaluated on held-out test set after each training/unlearning epoch), scores are plotted over 2000 epochs. The first 1000 epochs represent the training phase, epochs 1000-2000 represent the unlearning phase. Thin transparent lines show individual, runs thick solid lines represent the mean trajectory and shaded regions represent standard deviation.