Table of Contents
Fetching ...

FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung

TL;DR

This work tackles the problem of faithfully forgetting private or sensitive knowledge in language models by revealing the phenomenon of superficial unlearning, where forgetting may erode interconnected knowledge or fail to remove contextual dependencies. It introduces FaithUn, a benchmark with paraphrased, multi-hop, and same-answer datasets built on a world knowledge graph, plus a formal problem definition and evaluation framework. The proposed KLUE method localizes unlearning to knowledge-relevant neurons using attribution-based scoring and a superficial-knowledge regularization, then updates only those neurons with unforgotten samples to avoid overfitting. Empirical results show that standard unlearning methods exhibit superficial forgetting, while KLUE achieves significantly more faithful unlearning, preserving unrelated knowledge and effectively erasing interconnected knowledge across real-world QA scenarios. The work provides a practical, scalable pathway toward safer and privacy-conscious LLM deployment, and highlights explicit measurement of knowledge interdependencies in unlearning research.

Abstract

Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.

FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge

TL;DR

This work tackles the problem of faithfully forgetting private or sensitive knowledge in language models by revealing the phenomenon of superficial unlearning, where forgetting may erode interconnected knowledge or fail to remove contextual dependencies. It introduces FaithUn, a benchmark with paraphrased, multi-hop, and same-answer datasets built on a world knowledge graph, plus a formal problem definition and evaluation framework. The proposed KLUE method localizes unlearning to knowledge-relevant neurons using attribution-based scoring and a superficial-knowledge regularization, then updates only those neurons with unforgotten samples to avoid overfitting. Empirical results show that standard unlearning methods exhibit superficial forgetting, while KLUE achieves significantly more faithful unlearning, preserving unrelated knowledge and effectively erasing interconnected knowledge across real-world QA scenarios. The work provides a practical, scalable pathway toward safer and privacy-conscious LLM deployment, and highlights explicit measurement of knowledge interdependencies in unlearning research.

Abstract

Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.

Paper Structure

This paper contains 50 sections, 4 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Faithful Forgetting in LLMs.FaithUn proposes three datasets to evaluate unlearning methods (i.e., Paraphrased, Multi-hop, and Same-answer datasets). Each target knowledge to be unlearned is mapped with questions from these three datasets for evaluation.
  • Figure 2: The relationship between UA and other metrics. The X-axis shows UA in descending order, and the Y-axis shows the accuracy of other metrics.
  • Figure 3: The ratio of neuron localization. We plot the accuracy of each metric for varying ratios of neurons.
  • Figure 4: The number of data instances per entity. The X-axis of the figure corresponds to the entity index, which is sorted in descending order of popularity. The Y-axis means the number of questions to be unlearned for each entity.
  • Figure 5: Relation frequency for each dataset. the Base QA dataset (left), the Multi-hop QA dataset (middle), and the Same-answer QA dataset (right).
  • ...and 5 more figures