Probing Knowledge Holes in Unlearned LLMs
Myeongseob Ko, Hoang Anh Just, Charles Fleming, Ming Jin, Ruoxi Jia
TL;DR
This paper interrogates the hidden costs of machine unlearning in large language models by introducing a three-step knowledge evaluation pipeline that combines adjacent probing, RL-enhanced latent probing, and post-hoc filtering. It demonstrates that unlearning can produce substantial knowledge holes not captured by standard benchmarks, with latent holes often far larger in quality degradation than adjacent holes. Using two forgetting datasets and multiple unlearning methods, the study shows that even when harmful content is removed, benign knowledge can be systematically forgotten, and mitigation attempts can introduce new holes. The work argues for dynamic, model-dependent evaluations and proactive mitigation strategies to balance the removal of harmful content with the preservation of benign knowledge.
Abstract
Machine unlearning has emerged as a prevalent technical solution for selectively removing unwanted knowledge absorbed during pre-training, without requiring full retraining. While recent unlearning techniques can effectively remove undesirable content without severely compromising performance on standard benchmarks, we find that they may inadvertently create ``knowledge holes'' -- unintended losses of benign knowledge that standard benchmarks fail to capture. To probe where unlearned models reveal knowledge holes, we propose a test case generation framework that explores both immediate neighbors of unlearned content and broader areas of potential failures. Our evaluation demonstrates significant hidden costs of unlearning: up to 98.7\% of the test cases yield irrelevant or nonsensical responses from unlearned models, despite being answerable by the pretrained model. These findings necessitate rethinking the conventional approach to evaluating knowledge preservation in unlearning, moving beyond standard, static benchmarks.
