Table of Contents
Fetching ...

Making AI Forget You: Data Deletion in Machine Learning

Antonio Ginart, Melody Y. Guan, Gregory Valiant, James Zou

TL;DR

This work formalizes deletion efficiency for machine learning, framing deletion as making a model trained on $D$ with a point $i$ indistinguishable from one trained on $D_{-i}$. It introduces two deletion-efficient, provably sound clustering algorithms for $k$-means: Quantized $k$-Means (Q-$k$-means) and Divide-and-Conquer $k$-Means (DC-$k$-means), achieving large amortized speedups (over $100\times$ on six datasets) while preserving statistical quality relative to $k$-means++ baselines. The authors establish online deletion lower bounds, derive amortized runtime guarantees under practical parameterizations, and demonstrate strong empirical performance across diverse datasets, highlighting the viability of deletion-efficient learning. They also propose four general engineering principles—Linearity, Laziness, Modularity, and Quantization—for designing deletion-efficient ML systems and discuss extensions to other models such as kernel methods, decision trees, and deep networks.

Abstract

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.

Making AI Forget You: Data Deletion in Machine Learning

TL;DR

This work formalizes deletion efficiency for machine learning, framing deletion as making a model trained on with a point indistinguishable from one trained on . It introduces two deletion-efficient, provably sound clustering algorithms for -means: Quantized -Means (Q--means) and Divide-and-Conquer -Means (DC--means), achieving large amortized speedups (over on six datasets) while preserving statistical quality relative to -means++ baselines. The authors establish online deletion lower bounds, derive amortized runtime guarantees under practical parameterizations, and demonstrate strong empirical performance across diverse datasets, highlighting the viability of deletion-efficient learning. They also propose four general engineering principles—Linearity, Laziness, Modularity, and Quantization—for designing deletion-efficient ML systems and discuss extensions to other models such as kernel methods, decision trees, and deep networks.

Abstract

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.

Paper Structure

This paper contains 64 sections, 15 theorems, 9 equations, 8 figures, 4 tables, 7 algorithms.

Key Result

Theorem 4.1

Let $D$ be a dataset on $[0,1]^d$ of size $n$. Fix parameters $T$, $k$, $\epsilon$, and $\gamma$ for Q-$k$-means. Then, Q-$k$-means supports $m$ deletions in time $O(m^2d^{5/2}/\epsilon)$ in expectation, with probability over the randomness in the quantization phase and $k$-means++ initialization.

Figures (8)

  • Figure 1: Online deletion efficiency: # of deletions vs. amortized runtime (secs) for 3 algorithms on 6 datasets.
  • Figure 2: Average retrain occurrences during deletion stream for Q-$k$-means
  • Figure 3: Loss Ratio vs. $\epsilon$ for Q-$k$-means on 6 datasets
  • Figure 4: Loss Ratio vs. $w$ for DC-$k$-means on 6 datasets
  • Figure 5: Amortized runtime (seconds) for Q-$k$-means as a function of quantization granularity on Covtype
  • ...and 3 more figures

Theorems & Definitions (34)

  • Definition 3.1
  • Remark 3.1
  • Theorem 4.1
  • Corollary 4.1.1
  • Proposition 4.2
  • Corollary 4.2.1
  • Corollary 4.2.2
  • Definition A.1
  • Definition A.2
  • Definition A.3
  • ...and 24 more