Table of Contents
Fetching ...

Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition

Eleni Triantafillou, Peter Kairouz, Fabian Pedregosa, Jamie Hayes, Meghdad Kurmanji, Kairan Zhao, Vincent Dumoulin, Julio Jacques Junior, Ioannis Mitliagkas, Jun Wan, Lisheng Sun Hosoya, Sergio Escalera, Gintare Karolina Dziugaite, Peter Triantafillou, Isabelle Guyon

TL;DR

This work formalizes machine unlearning via a $$(\varepsilon,\delta)$$-unlearning framework and introduces an attack-based, empirical evaluation methodology to quantify forgetting quality $\mathcal{F}$ alongside model utility and efficiency. Forgetting quality is estimated per forget-set example by assessing how well an attacker can distinguish between retrained and unlearned model outputs, and these per-example scores are aggregated to yield a holistic forgetting score that guides ranking. Through extensive analysis of NeurIPS competition submissions and comparisons to prior state-of-the-art, the authors demonstrate that several top entries achieve superior forgetting under the proposed metric while maintaining acceptable utility, and they provide insights into per-example hardness and dataset generalizability. The findings underscore progress in unlearning and emphasize the need for standardized, compute-conscious benchmarking and attention to generalizability across datasets and tasks.

Abstract

We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and initiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.

Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition

TL;DR

This work formalizes machine unlearning via a -unlearning framework and introduces an attack-based, empirical evaluation methodology to quantify forgetting quality alongside model utility and efficiency. Forgetting quality is estimated per forget-set example by assessing how well an attacker can distinguish between retrained and unlearned model outputs, and these per-example scores are aggregated to yield a holistic forgetting score that guides ranking. Through extensive analysis of NeurIPS competition submissions and comparisons to prior state-of-the-art, the authors demonstrate that several top entries achieve superior forgetting under the proposed metric while maintaining acceptable utility, and they provide insights into per-example hardness and dataset generalizability. The findings underscore progress in unlearning and emphasize the need for standardized, compute-conscious benchmarking and attention to generalizability across datasets and tasks.

Abstract

We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and initiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.
Paper Structure (46 sections, 1 theorem, 5 equations, 22 figures, 1 table, 2 algorithms)

This paper contains 46 sections, 1 theorem, 5 equations, 22 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

Fix $\mathcal{D}$, $\mathcal{S} \subseteq \mathcal{D}$, and a randomized learning algorithm $\mathcal{A}$. Assume $\mathcal{U}$ is an $(\varepsilon,\delta)$-unlearning algorithm with respect to $(\mathcal{D}, \mathcal{S}, \mathcal{A})$. Let $X$ be sampled from either $\mathcal{A}(\mathcal{D} \setmin

Figures (22)

  • Figure 1: Left: overview of the evaluation of forgetting quality. We draw $N$ samples of $\theta^o$ and $\theta^r$, by repeating procedures $\mathcal{A}(\mathcal{D})$ and $\mathcal{A}(\mathcal{D} \setminus \mathcal{S})$, respectively, $N$ times, with different random seeds. Then, we obtain $N$ samples of $\theta^u$ by applying $\mathcal{U}$ on each of the original model samples $\theta^o$. We then compute an estimate of forgetting quality $\mathcal{F}$ based on how similar the distributions of $\theta^u$ and $\theta^r$ are, according to a 1-dimensional test statistic (Section \ref{['sec:eps_computation']}). Closeness of those distributions indicates good unlearning, associated with higher $\mathcal{F}$-score. Right: example decision rule to separate the histograms of a 1-dimensional test statistic of the distributions of $\theta^u$ and $\theta^r$, for a given example in the forget set (see Section \ref{['sec:eps_computation']}). This decision rule predicts "unlearned" for values greater than the threshold shown as a black dotted line. As we describe in Section \ref{['sec:eps_computation']}, we sweep several thresholds and use the one that best separates the two distributions to measure their closeness.
  • Figure 2: Practical instantiations of our evaluation framework that explore the accuracy / efficiency trade-off. In each case, $N$ samples from each of $\theta^u$ and $\theta^r$ are used to compute an estimate of $\mathcal{F}$, and we obtain $E$ samples of $\mathcal{F}$ to compute confidence intervals. Each setup differs in how much "work" is reused across the $E$ "experiments". The plate notation (a rectangle with a number at its bottom right corner (e.g. $N$ or $E$)) denotes that the contents of the rectangle are repeated that number of times. Left, Setup "Full": In each of $E$ experiments, we draw $N$ samples of every distribution. This is the statistically correct variant as it yields $E$ i.i.d samples of $\mathcal{F}$ but is very costly. Middle, Setup "Reuse-$N$-$N$": $N$ samples of each of $\theta^o$ and $\theta^r$ are drawn once and reused across the $E$ runs, each of which simply runs $\mathcal{U}$ on top of each sample of $\theta^o$. Right, Setup "Reuse-$N$-1": a single sample of $\theta^o$ is used to obtain all samples of $\theta^u$, and a single set of $N$ samples of $\theta^r$ is reused for all $E$ runs.
  • Figure 3: Commonalities between participants' methods. We illustrate the top three approaches here and provide diagrams for all analyzed competition methods in \ref{['fig:methods_overview_full']}.
  • Figure 4: $\mathcal{F}$-scores obtained by different setups trading-off accuracy / efficiency (see Section \ref{['sec:practical_instantiations']}). $N$ = 1024, $E$ = 20.
  • Figure 5: Comparing leading competition algorithms (to the right of the dotted line) against state-of-the-art from the literature (to the left of the dotted line). We notice that several algorithms from the competition outperform existing ones according to our metrics. Setup "Full", $N$ = 1024, $E=10$.
  • ...and 17 more figures

Theorems & Definitions (2)

  • Definition 2.1
  • Theorem : Theorem 2.2 (adapted from kairouz2015composition)