Making AI Forget You: Data Deletion in Machine Learning
Antonio Ginart, Melody Y. Guan, Gregory Valiant, James Zou
TL;DR
This work formalizes deletion efficiency for machine learning, framing deletion as making a model trained on $D$ with a point $i$ indistinguishable from one trained on $D_{-i}$. It introduces two deletion-efficient, provably sound clustering algorithms for $k$-means: Quantized $k$-Means (Q-$k$-means) and Divide-and-Conquer $k$-Means (DC-$k$-means), achieving large amortized speedups (over $100\times$ on six datasets) while preserving statistical quality relative to $k$-means++ baselines. The authors establish online deletion lower bounds, derive amortized runtime guarantees under practical parameterizations, and demonstrate strong empirical performance across diverse datasets, highlighting the viability of deletion-efficient learning. They also propose four general engineering principles—Linearity, Laziness, Modularity, and Quantization—for designing deletion-efficient ML systems and discuss extensions to other models such as kernel methods, decision trees, and deep networks.
Abstract
Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used --- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.
