On the limitation of evaluating machine unlearning using only a single training seed
Jamie Lanyon, Axel Finke, Petros Andreou, Georgina Cosma
TL;DR
This work demonstrates that evaluating machine unlearning using a single training seed can yield misleading conclusions, particularly for deterministic MU methods. It develops a formal, seed-aware evaluation framework and shows that total unlearning variability comprises training-seed and unlearning-seed contributions, with the former captured by Var[E(Z|S)] in the variance decomposition $\mathrm{Var}[Z] = \mathrm{Var}[\mathbb{E}(Z|S)] + \mathbb{E}[\mathrm{Var}(Z|S)]$. Empirically, across image classification (CIFAR-100/20 with ResNet-18) and federated online learning to rank (MQ2007 LETOR) tasks, single-seed evaluations of MU methods, especially deterministic ones like SSD and LFSSD, can be highly non-representative; multiple training seeds cannot be fully compensated by multiple unlearning seeds. The findings advocate for reporting results over multiple training seeds and extend the validity of seed-aware evaluation beyond image tasks, including FOLTR, with practical implications for robust MU benchmarking and reproducibility.
Abstract
Machine unlearning (MU) aims to remove the influence of certain data points from a trained model without costly retraining. Most practical MU algorithms are only approximate and their performance can only be assessed empirically. Care must therefore be taken to make empirical comparisons as representative as possible. A common practice is to run the MU algorithm multiple times independently starting from the same trained model. In this work, we demonstrate that this practice can give highly non-representative results because -- even for the same architecture and same dataset -- some MU methods can be highly sensitive to the choice of random number seed used for model training. We illustrate that this is particularly relevant for MU methods that are deterministic, i.e., which always produce the same result when started from the same trained model. We therefore recommend that empirical comparisons of MU algorithms should also reflect the variability across different model training seeds.
