Table of Contents
Fetching ...

Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models

Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao

TL;DR

This work tackles the problem of forgetting implicit knowledge in large language models (LLMs), arguing that existing unlearning methods poorly generalise to related or paraphrased information. It introduces PerMU, a perturbation-based approach that identifies the most sensitive tokens via a model-sensitivity metric (MSM) and alters the logit distribution through perturbations, followed by distribution subtraction to suppress fact-related tokens while preserving non-fact content. The authors formalise an expanded unlearning scope that includes paraphrases and one-hop reasoning, and evaluate 15 methods across diverse datasets (TOFU, Harry Potter, ZsRE, WMDP, MUSE) and model scales (1.3B–13B), reporting substantial gains in forgetting and generalisation with PerMU (up to 50.40% improvement in forgetting target data and up to 40.73% improvement in forgetting implicit knowledge) while maintaining utility. They also offer a fast variant and extensive ablations to understand trade-offs among perturbation level, retain losses, and tuning coefficients, demonstrating PerMU’s robustness and practical potential for generalized implicit knowledge forgetting in LLMs.

Abstract

In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.

Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models

TL;DR

This work tackles the problem of forgetting implicit knowledge in large language models (LLMs), arguing that existing unlearning methods poorly generalise to related or paraphrased information. It introduces PerMU, a perturbation-based approach that identifies the most sensitive tokens via a model-sensitivity metric (MSM) and alters the logit distribution through perturbations, followed by distribution subtraction to suppress fact-related tokens while preserving non-fact content. The authors formalise an expanded unlearning scope that includes paraphrases and one-hop reasoning, and evaluate 15 methods across diverse datasets (TOFU, Harry Potter, ZsRE, WMDP, MUSE) and model scales (1.3B–13B), reporting substantial gains in forgetting and generalisation with PerMU (up to 50.40% improvement in forgetting target data and up to 40.73% improvement in forgetting implicit knowledge) while maintaining utility. They also offer a fast variant and extensive ablations to understand trade-offs among perturbation level, retain losses, and tuning coefficients, demonstrating PerMU’s robustness and practical potential for generalized implicit knowledge forgetting in LLMs.

Abstract

In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.

Paper Structure

This paper contains 27 sections, 1 theorem, 2 equations, 6 figures, 18 tables, 1 algorithm.

Key Result

Proposition 5.1

Fix $\Delta_i$, $\mathcal{J}(x_i, \hat{x}_i) \propto \lambda_i$, where $\lambda_i$ is the maximum eigenvalue of $\mathbf{H}(x_i)$, $i\in\{1, 2, \dots, m\}$.

Figures (6)

  • Figure 1: Depiction of the proposed unlearning scope in a hypothetical semantic embedding space, highlighting the generalisation dilemma inherent in machine unlearning for LLMs. Ideally, hard in-scope samples that lie within the unlearning scope by a small margin should also be forgotten. These include rephrased questions, as well as the relation reversed questions and so on.
  • Figure 2: The ranking of the first key token for the correct answer in the next-token probability distribution rises rapidly in the mid-layers of the unlearned model fine-tuned with Gradient Ascent.
  • Figure 3: Depiction of PerMU. Left: The clean run involves inputting the original unlearning sample into the model, enabling it to successfully recall the facts and generate a fact-related probability distribution. Right: The corrupted run refers to inputting a perturbed unlearning sample into the model, making it fail to recall the facts and produce a fact-unrelated probability distribution, where the ground truth ranks significantly lower in the distribution.
  • Figure 4: The curves illustrating how Model Utility changes with the Forget Ratio. The closer a method is to the upper left corner, the better it balances model utility and the forgetting effect. The proposed PerMU encompasses nearly all baseline methods from the top left, demonstrating superior unlearning performance.
  • Figure 5: The ranking of the first answer token in the next-token probability distribution across layers of the unlearned model. PerMU consistently achieves lower ranks across all layers, indicating that the model fails to recall facts.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Proposition 5.1