Table of Contents
Fetching ...

Machine Unlearning: A Comprehensive Survey

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Shui Yu

TL;DR

This survey provides a comprehensive, taxonomy-driven overview of machine unlearning, detailing centralized (exact and approximate) and distributed (federated) approaches, as well as unlearning verification and privacy/security concerns. It covers core techniques (split learning, certified data removal, Bayesian unlearning, graph unlearning), evaluation metrics, datasets, and practical challenges such as stochasticity, incrementality, and catastrophic unlearning. By synthesizing 136 relevant works (2020 onward) and outlining open questions, it offers a structured map of the field and concrete directions for future research, including robust verification without utility loss and secure, scalable unlearning in federated and graph contexts. The work highlights the practical impact of machine unlearning in enforcing privacy rights while balancing model utility and security in real-world ML deployments.

Abstract

As the right to be forgotten has been legislated worldwide, many studies attempt to design unlearning mechanisms to protect users' privacy when they want to leave machine learning service platforms. Specifically, machine unlearning is to make a trained model to remove the contribution of an erased subset of the training dataset. This survey aims to systematically classify a wide range of machine unlearning and discuss their differences, connections and open problems. We categorize current unlearning methods into four scenarios: centralized unlearning, distributed and irregular data unlearning, unlearning verification, and privacy and security issues in unlearning. Since centralized unlearning is the primary domain, we use two parts to introduce: firstly, we classify centralized unlearning into exact unlearning and approximate unlearning; secondly, we offer a detailed introduction to the techniques of these methods. Besides the centralized unlearning, we notice some studies about distributed and irregular data unlearning and introduce federated unlearning and graph unlearning as the two representative directions. After introducing unlearning methods, we review studies about unlearning verification. Moreover, we consider the privacy and security issues essential in machine unlearning and organize the latest related literature. Finally, we discuss the challenges of various unlearning scenarios and address the potential research directions.

Machine Unlearning: A Comprehensive Survey

TL;DR

This survey provides a comprehensive, taxonomy-driven overview of machine unlearning, detailing centralized (exact and approximate) and distributed (federated) approaches, as well as unlearning verification and privacy/security concerns. It covers core techniques (split learning, certified data removal, Bayesian unlearning, graph unlearning), evaluation metrics, datasets, and practical challenges such as stochasticity, incrementality, and catastrophic unlearning. By synthesizing 136 relevant works (2020 onward) and outlining open questions, it offers a structured map of the field and concrete directions for future research, including robust verification without utility loss and secure, scalable unlearning in federated and graph contexts. The work highlights the practical impact of machine unlearning in enforcing privacy rights while balancing model utility and security in real-world ML deployments.

Abstract

As the right to be forgotten has been legislated worldwide, many studies attempt to design unlearning mechanisms to protect users' privacy when they want to leave machine learning service platforms. Specifically, machine unlearning is to make a trained model to remove the contribution of an erased subset of the training dataset. This survey aims to systematically classify a wide range of machine unlearning and discuss their differences, connections and open problems. We categorize current unlearning methods into four scenarios: centralized unlearning, distributed and irregular data unlearning, unlearning verification, and privacy and security issues in unlearning. Since centralized unlearning is the primary domain, we use two parts to introduce: firstly, we classify centralized unlearning into exact unlearning and approximate unlearning; secondly, we offer a detailed introduction to the techniques of these methods. Besides the centralized unlearning, we notice some studies about distributed and irregular data unlearning and introduce federated unlearning and graph unlearning as the two representative directions. After introducing unlearning methods, we review studies about unlearning verification. Moreover, we consider the privacy and security issues essential in machine unlearning and organize the latest related literature. Finally, we discuss the challenges of various unlearning scenarios and address the potential research directions.
Paper Structure (27 sections, 8 figures, 5 tables)

This paper contains 27 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our taxonomy for machine unlearning. The introduction order will also follow this figure. We classify the current unlearning literature into four main scenarios: centralized unlearning, federated unlearning, unlearning verification, and privacy and security issues in machine unlearning.
  • Figure 2: Machine Unlearning Process
  • Figure 3: Privacy Leakage: a Privacy Reconstruction Process
  • Figure 4: The model changes when adding a new point or removing a point. (a) A normally trained classifying model classifies classes 1 and 2. (b) When a new point appears, the model is trained based on it, and the classifying line is pushed to classify it. (c) When we need to remove an influential point, we should recover the contribution of this data point on the model. (d) When we remove a Non-influential point, the model may not need to change a lot.
  • Figure 5: (a) Naive unlearning. There are only two steps: delete the specified samples from the whole dataset and retrain a model based on the remaining dataset. (b) Split unlearning. It contains four steps: 1. split the original dataset into $n$ shards, 2. remove the erased data from the corresponding shard, 3. retrain the sub-model of this shard, 4. ensemble all sub-models as the final model.
  • ...and 3 more figures