LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data
Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu
TL;DR
Large language models memorize sensitive data, raising privacy and safety concerns, which motivates unlearning of undesirable content. This paper investigates LLM unlearning when the forget data is perturbed during training, introducing masked, rewritten, and watermarked forget sets. It benchmarks two state-of-the-art unlearning methods, NPO and RMU, on WMDP and MUSE to assess robustness and utility. A saliency-based interpretation shows that core semantic cues governing forgetting are preserved across perturbations, explaining resilience to surface-level changes. The results support a data-centric view of unlearning and highlight practical implications for real-world deployment under imperfect forget data.
Abstract
Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.
