Table of Contents
Fetching ...

A Survey on Unlearning in Large Language Models

Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun

TL;DR

The paper surveys unlearning in large language models, addressing privacy, copyright, and safety risks from memorized data. It articulates a phase-based taxonomy spanning training-time, post-training, and inference-time interventions, with a dual emphasis on parameter modification versus parameter selection and a goal that $M_u$ approximates the retrained model $M_r$ trained on $D_r$. It provides a multidimensional evaluation framework, covering 18 benchmarks and decomposing knowledge-memorization metrics into 10 categories alongside model utility, robustness, and efficiency measures. It also discusses challenges in definition, multilinguality, real-world deployment, and verifiable unlearning, and outlines future directions including specialized architectures, tool-enabled unlearning, robust verification, and scalable deployment.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems.

A Survey on Unlearning in Large Language Models

TL;DR

The paper surveys unlearning in large language models, addressing privacy, copyright, and safety risks from memorized data. It articulates a phase-based taxonomy spanning training-time, post-training, and inference-time interventions, with a dual emphasis on parameter modification versus parameter selection and a goal that approximates the retrained model trained on . It provides a multidimensional evaluation framework, covering 18 benchmarks and decomposing knowledge-memorization metrics into 10 categories alongside model utility, robustness, and efficiency measures. It also discusses challenges in definition, multilinguality, real-world deployment, and verifiable unlearning, and outlines future directions including specialized architectures, tool-enabled unlearning, robust verification, and scalable deployment.

Abstract

Large Language Models (LLMs) demonstrate remarkable capabilities, but their training on massive corpora poses significant risks from memorized sensitive information. To mitigate these issues and align with legal standards, unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021. First, it introduces a novel taxonomy that categorizes unlearning methods based on the phase in the LLM pipeline of the intervention. This framework further distinguishes between parameter modification and parameter selection strategies, thus enabling deeper insights and more informed comparative analysis. Second, it offers a multidimensional analysis of evaluation paradigms. For datasets, we compare 18 existing benchmarks from the perspectives of task format, content, and experimental paradigms to offer actionable guidance. For metrics, we move beyond mere enumeration by dividing knowledge memorization metrics into 10 categories to analyze their advantages and applicability, while also reviewing metrics for model utility, robustness, and efficiency. By discussing current challenges and future directions, this survey aims to advance the field of LLM unlearning and the development of secure AI systems.

Paper Structure

This paper contains 48 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of an unlearning process. The box below the model represents the composition of the corresponding training set. The unlearn set $\mathcal{D}_u$ is represented by the shadow square and the retain set $\mathcal{D}_r$ is represented by the white square. An unlearning algorithm is applying on the initial target model to obtain the unlearned model $\mathcal{M}_u$. And the unlearned model is expected to approximate the retrained model $\mathcal{M}_r$.
  • Figure 2: Examples of different requests. We extract some fragments from the unlearn set of the corresponding work. At an entity level, in addition to the entity for unlearning, we also show generated samples of these entity, giving an illustration of converting entity-level unlearning to sample-level unlearning.
  • Figure 3: Framework of unlearning methods. In typical LLM usage scenarios, a model is first trained on specific datasets, and then is used for inference to generate outputs. The unlearning method can be applied to the training process, the trained model, or the inference stage, corresponding to training-time unlearning (Section \ref{['sec:training-time']}, post-training unlearning (Section \ref{['sec:post-training']}) and inference-time unlearning (Section \ref{['sec:infer-time']}).
  • Figure 4: Objective designs of unlearning methods. The color coding is as follows: blue for text,red for tensors/vectors,orange for loss functions. Text-based and distribution-based methods compute a loss function at the output layer by comparing it to a reference (ref.), in textual and distributional level, respectively. Activation-based methods compute the loss using activations from the hidden layers against a reference.
  • Figure 5: Illustration of three different approaches of incorporating new structure. Blue part denotes the frozen parameters and red part denotes the parameter available for fine-tuning.
  • ...and 2 more figures