Table of Contents
Fetching ...

Digital Forgetting in Large Language Models: A Survey of Unlearning Methods

Alberto Blanco-Justicia, Najeeb Jebreel, Benet Manzanares, David Sánchez, Josep Domingo-Ferrer, Guillem Collell, Kuan Eeik Tan

TL;DR

This survey tackles digital forgetting in large language models by surveying unlearning methods and presenting a four-way taxonomy: global, local, architecture, and input/output modification. It systematically analyzes forgetting guarantees, model utility retention, and computational efficiency, highlighting approaches from data-sharded retraining to prompt-based and memory-editing strategies. The work compiles datasets, models, metrics, and attacks used to evaluate forgetting and retention, and outlines key challenges such as achieving provable guarantees and scalable real-world deployment. Its contribution lies in clarifying the landscape of unlearning methods, identifying gaps, and guiding future research toward benchmark-driven, scalable, and robust forgetting in LLMs. The findings underscore the trade-offs between exact forgetting guarantees and practical applicability, urging integrated, modular approaches to balance effectiveness, utility, and efficiency.

Abstract

The objective of digital forgetting is, given a model with undesirable knowledge or behavior, obtain a new model where the detected issues are no longer present. The motivations for forgetting include privacy protection, copyright protection, elimination of biases and discrimination, and prevention of harmful content generation. Effective digital forgetting has to be effective (meaning how well the new model has forgotten the undesired knowledge/behavior), retain the performance of the original model on the desirable tasks, and be scalable (in particular forgetting has to be more efficient than retraining from scratch on just the tasks/data to be retained). This survey focuses on forgetting in large language models (LLMs). We first provide background on LLMs, including their components, the types of LLMs, and their usual training pipeline. Second, we describe the motivations, types, and desired properties of digital forgetting. Third, we introduce the approaches to digital forgetting in LLMs, among which unlearning methodologies stand out as the state of the art. Fourth, we provide a detailed taxonomy of machine unlearning methods for LLMs, and we survey and compare current approaches. Fifth, we detail datasets, models and metrics used for the evaluation of forgetting, retaining and runtime. Sixth, we discuss challenges in the area. Finally, we provide some concluding remarks.

Digital Forgetting in Large Language Models: A Survey of Unlearning Methods

TL;DR

This survey tackles digital forgetting in large language models by surveying unlearning methods and presenting a four-way taxonomy: global, local, architecture, and input/output modification. It systematically analyzes forgetting guarantees, model utility retention, and computational efficiency, highlighting approaches from data-sharded retraining to prompt-based and memory-editing strategies. The work compiles datasets, models, metrics, and attacks used to evaluate forgetting and retention, and outlines key challenges such as achieving provable guarantees and scalable real-world deployment. Its contribution lies in clarifying the landscape of unlearning methods, identifying gaps, and guiding future research toward benchmark-driven, scalable, and robust forgetting in LLMs. The findings underscore the trade-offs between exact forgetting guarantees and practical applicability, urging integrated, modular approaches to balance effectiveness, utility, and efficiency.

Abstract

The objective of digital forgetting is, given a model with undesirable knowledge or behavior, obtain a new model where the detected issues are no longer present. The motivations for forgetting include privacy protection, copyright protection, elimination of biases and discrimination, and prevention of harmful content generation. Effective digital forgetting has to be effective (meaning how well the new model has forgotten the undesired knowledge/behavior), retain the performance of the original model on the desirable tasks, and be scalable (in particular forgetting has to be more efficient than retraining from scratch on just the tasks/data to be retained). This survey focuses on forgetting in large language models (LLMs). We first provide background on LLMs, including their components, the types of LLMs, and their usual training pipeline. Second, we describe the motivations, types, and desired properties of digital forgetting. Third, we introduce the approaches to digital forgetting in LLMs, among which unlearning methodologies stand out as the state of the art. Fourth, we provide a detailed taxonomy of machine unlearning methods for LLMs, and we survey and compare current approaches. Fifth, we detail datasets, models and metrics used for the evaluation of forgetting, retaining and runtime. Sixth, we discuss challenges in the area. Finally, we provide some concluding remarks.
Paper Structure (50 sections, 5 equations, 4 figures, 4 tables)

This paper contains 50 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The transformer architecture. Note that, since xiong2020layer, layer normalization is typically applied before the attention and the feed-forward layers instead of after addition. Image source extracted from https://github.com/negrinho/sane_tikz/blob/master/examples/transformer.tex under the MIT licence.
  • Figure 2: Encoder-only transformer
  • Figure 3: Decoder-only transformer
  • Figure 4: Taxonomy of unlearning methods in LLMs

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5