UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Ilia Shumailov; Jamie Hayes; Eleni Triantafillou; Guillermo Ortiz-Jimenez; Nicolas Papernot; Matthew Jagielski; Itay Yona; Heidi Howard; Eugene Bagdasaryan

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Ilia Shumailov, Jamie Hayes, Eleni Triantafillou, Guillermo Ortiz-Jimenez, Nicolas Papernot, Matthew Jagielski, Itay Yona, Heidi Howard, Eugene Bagdasaryan

TL;DR

The paper tackles the challenge of removing impermissible knowledge from large language models and identifies a fundamental inconsistency arising from in-context learning (ICL). It introduces unUnlearning, where erased knowledge can be reintroduced via contextual prompts, undermining the effectiveness of traditional unlearning for content regulation. Through a formal setting with $M: \mathcal{X} \to \mathcal{Y}$ and an unlearned subset $\hat{\mathcal{X}}$, it shows that $M(\hat{X}) \approx \hat{M}(\text{prompt}+\hat{X})$, i.e., prompts can resurrect forgotten capabilities. The work argues that exact unlearning is not enough for controlling impermissible knowledge and underscores the need for continuous content filtering, while discussing knowledge types, attribution, and forbidding strategies as complementary approaches. Overall, the paper calls for rethinking unlearning as a sole regulator and emphasizes content-based safeguards and policy design for practical deployment of LLMs.

Abstract

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

TL;DR

and an unlearned subset

, it shows that

, i.e., prompts can resurrect forgotten capabilities. The work argues that exact unlearning is not enough for controlling impermissible knowledge and underscores the need for continuous content filtering, while discussing knowledge types, attribution, and forbidding strategies as complementary approaches. Overall, the paper calls for rethinking unlearning as a sole regulator and emphasizes content-based safeguards and policy design for practical deployment of LLMs.

Abstract

Paper Structure (6 sections, 2 figures)

This paper contains 6 sections, 2 figures.

Introduction
Nomenclature
Types of Knowledge
UnUnlearning
Discussion
Conclusion

Figures (2)

Figure 1: We broadly separate the knowledge into two main types: axioms and theorems to represent given facts and derived knowledge respectively. In the example above we assume that all theorems are defined in terms of underlying axioms, where some axioms are shared by different theorems. While unlearning of Cat may let the model forget what Cat means, it is relatively easy to redefine it provided that the underlying axioms are preserved.
Figure 2: The figure demonstrates the concept of ununlearning. Here, the model that at first possess impermissible bomb making knowledge. The defender uses exact unlearning to remove all instances of usage of the term bomb, making the model incapable of providing bomb recipes, since it does not possess the knowledge of what that term describes. The adversary uses the knowledge still available to the model to describe the concept and as a result the model provides a response.

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

TL;DR

Abstract

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Authors

TL;DR

Abstract

Table of Contents

Figures (2)