Table of Contents
Fetching ...

Delete My Account: Impact of Data Deletion on Machine Learning Classifiers

Tobias Dam, Maximilian Henzl, Lukas Daniel Klausner

TL;DR

The paper investigates how the GDPR right to erasure affects classifier performance by simulating data deletions under several plausible user-behavior scenarios. Using four supervised classifiers (SVM, k-NN, Random Forest, XGBoost) and four real-world datasets, it compares random deletions with bias-driven and incremental deletion schemes. Its key finding is that the impact of deletion depends on the deletion amount, dataset size, and the specific attribute values that influence the target variable; biased deletions can cause notable performance degradation, especially in larger datasets and at high deletion levels. The study provides a reproducible framework for evaluating data deletion effects and highlights the need for empirical data on deletion behavior to guide privacy-preserving ML practices.

Abstract

Users are more aware than ever of the importance of their own data, thanks to reports about security breaches and leaks of private, often sensitive data in recent years. Additionally, the GDPR has been in effect in the European Union for over three years and many people have encountered its effects in one way or another. Consequently, more and more users are actively protecting their personal data. One way to do this is to make of the right to erasure guaranteed in the GDPR, which has potential implications for a number of different fields, such as big data and machine learning. Our paper presents an in-depth analysis about the impact of the use of the right to erasure on the performance of machine learning models on classification tasks. We conduct various experiments utilising different datasets as well as different machine learning algorithms to analyse a variety of deletion behaviour scenarios. Due to the lack of credible data on actual user behaviour, we make reasonable assumptions for various deletion modes and biases and provide insight into the effects of different plausible scenarios for right to erasure usage on data quality of machine learning. Our results show that the impact depends strongly on the amount of data deleted, the particular characteristics of the dataset and the bias chosen for deletion and assumptions on user behaviour.

Delete My Account: Impact of Data Deletion on Machine Learning Classifiers

TL;DR

The paper investigates how the GDPR right to erasure affects classifier performance by simulating data deletions under several plausible user-behavior scenarios. Using four supervised classifiers (SVM, k-NN, Random Forest, XGBoost) and four real-world datasets, it compares random deletions with bias-driven and incremental deletion schemes. Its key finding is that the impact of deletion depends on the deletion amount, dataset size, and the specific attribute values that influence the target variable; biased deletions can cause notable performance degradation, especially in larger datasets and at high deletion levels. The study provides a reproducible framework for evaluating data deletion effects and highlights the need for empirical data on deletion behavior to guide privacy-preserving ML practices.

Abstract

Users are more aware than ever of the importance of their own data, thanks to reports about security breaches and leaks of private, often sensitive data in recent years. Additionally, the GDPR has been in effect in the European Union for over three years and many people have encountered its effects in one way or another. Consequently, more and more users are actively protecting their personal data. One way to do this is to make of the right to erasure guaranteed in the GDPR, which has potential implications for a number of different fields, such as big data and machine learning. Our paper presents an in-depth analysis about the impact of the use of the right to erasure on the performance of machine learning models on classification tasks. We conduct various experiments utilising different datasets as well as different machine learning algorithms to analyse a variety of deletion behaviour scenarios. Due to the lack of credible data on actual user behaviour, we make reasonable assumptions for various deletion modes and biases and provide insight into the effects of different plausible scenarios for right to erasure usage on data quality of machine learning. Our results show that the impact depends strongly on the amount of data deleted, the particular characteristics of the dataset and the bias chosen for deletion and assumptions on user behaviour.
Paper Structure (24 sections, 2 equations, 14 figures, 1 table)

This paper contains 24 sections, 2 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Overview of F1 scores for random deletion for all datasets and classifiers.
  • Figure 2: Overview of F1 score difference between random and incremental deletion for all datasets and classifiers.
  • Figure 3: Comparison of F1 scores for differently biased deletion for all classifiers on the adult dataset.
  • Figure 4: Comparison of F1 scores for differently biased deletion for all classifiers on the cahousing dataset.
  • Figure 5: Comparison of F1 scores for deletion method variations of the same attribute for all classifiers on the adult dataset.
  • ...and 9 more figures