Table of Contents
Fetching ...

Probing Knowledge Holes in Unlearned LLMs

Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia

TL;DR

This paper interrogates the hidden costs of machine unlearning in large language models by introducing a three-step knowledge evaluation pipeline that combines adjacent probing, RL-enhanced latent probing, and post-hoc filtering. It demonstrates that unlearning can produce substantial knowledge holes not captured by standard benchmarks, with latent holes often far larger in quality degradation than adjacent holes. Using two forgetting datasets and multiple unlearning methods, the study shows that even when harmful content is removed, benign knowledge can be systematically forgotten, and mitigation attempts can introduce new holes. The work argues for dynamic, model-dependent evaluations and proactive mitigation strategies to balance the removal of harmful content with the preservation of benign knowledge.

Abstract

Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.

Probing Knowledge Holes in Unlearned LLMs

TL;DR

This paper interrogates the hidden costs of machine unlearning in large language models by introducing a three-step knowledge evaluation pipeline that combines adjacent probing, RL-enhanced latent probing, and post-hoc filtering. It demonstrates that unlearning can produce substantial knowledge holes not captured by standard benchmarks, with latent holes often far larger in quality degradation than adjacent holes. Using two forgetting datasets and multiple unlearning methods, the study shows that even when harmful content is removed, benign knowledge can be systematically forgotten, and mitigation attempts can introduce new holes. The work argues for dynamic, model-dependent evaluations and proactive mitigation strategies to balance the removal of harmful content with the preservation of benign knowledge.

Abstract

Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.

Paper Structure

This paper contains 60 sections, 3 equations, 3 figures, 28 tables.

Figures (3)

  • Figure 1: Illustration of unlearning unwanted knowledge leading to unintended forgetting of benign knowledge, creating knowledge holes. These holes exist in both adjacent knowledge---benign questions involving keywords linked to the harmful knowledge---and latent knowledge that covers broader, unrelated topics.
  • Figure 2: Diversity-based filtering results on each forgetting dataset. We follow the three steps to obtain the final $\mathbf{D_\text{LP}}$ for PKU-SafeRLHF Dataset ji2024beavertails and WMDP-bio.
  • Figure 3: Unlearning Trade-offs Across Iterations. Left: MT-bench and Adjacent dataset scores demonstrating differential utility preservation. Right: PKU-SafeRLHF Dataset scores showing the progression of harm mitigation. This trade-off happens to all unlearning methods.