Table of Contents
Fetching ...

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research

A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee

TL;DR

The paper argues that 'machine unlearning'—comprising both removal of training data effects and suppression of outputs—is not a universal solution for governing generative AI due to deep technical and legal mismatches. It distinguishes removal from suppression, analyzes their respective guarantees and limitations, and shows how these translate into copyright, privacy, and safety policy challenges. By formalizing five core mismatches and offering domain-specific takeaways, the work guides ML researchers and policymakers toward realistic, best-effort interventions and system-level controls rather than reliance on perfect unlearning. The practical impact is a framework for evaluating unlearning approaches within concrete regulatory contexts and a call to integrate governance mechanisms beyond purely technical fixes.

Abstract

"Machine unlearning" is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model's parameters, e.g., a particular individual's personal data or the inclusion of copyrighted content in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research

TL;DR

The paper argues that 'machine unlearning'—comprising both removal of training data effects and suppression of outputs—is not a universal solution for governing generative AI due to deep technical and legal mismatches. It distinguishes removal from suppression, analyzes their respective guarantees and limitations, and shows how these translate into copyright, privacy, and safety policy challenges. By formalizing five core mismatches and offering domain-specific takeaways, the work guides ML researchers and policymakers toward realistic, best-effort interventions and system-level controls rather than reliance on perfect unlearning. The practical impact is a framework for evaluating unlearning approaches within concrete regulatory contexts and a call to integrate governance mechanisms beyond purely technical fixes.

Abstract

"Machine unlearning" is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model's parameters, e.g., a particular individual's personal data or the inclusion of copyrighted content in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.

Paper Structure

This paper contains 15 sections, 3 figures.

Figures (3)

  • Figure 1: One can think of CommonCanvas gokaslan2024commoncanvas as a "gold-standard" model that does not contain in-copyright images of Mickey Mouse: the only training data that contain Mickey Mouse expression are from personal photographs, e.g., (a). Even without unlicensed, in-copyright training images of Mickey Mouse, the model can generate outputs that resemble "Mickey Mouse," e.g., (b).
  • Figure 2: We scrape all papers that match unlearn* or model forgetting from arXiv and plot their counts over time, as of December 4, 2024. As of this date, there were a total of $810$ papers starting from 1997 that matched out query. We indicate some important dates in the release of contemporary language and image generation models: GPT-2, T5, DALL-E, PaLM, Stable Diffusion (SD), ChatGPT, and Claude.
  • Figure 3: Both the (a) back-end and (b) front-end involve processes that have their own inputs and produce their own outputs (simplified here). This is why we use this additional terminology for clarifying which inputs and outputs are under discussion. There is nothing complicated here; it is just shorthand to signal different aspects of the trained model at different points in time.

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3