Table of Contents
Fetching ...

LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data

Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu

TL;DR

Large language models memorize sensitive data, raising privacy and safety concerns, which motivates unlearning of undesirable content. This paper investigates LLM unlearning when the forget data is perturbed during training, introducing masked, rewritten, and watermarked forget sets. It benchmarks two state-of-the-art unlearning methods, NPO and RMU, on WMDP and MUSE to assess robustness and utility. A saliency-based interpretation shows that core semantic cues governing forgetting are preserved across perturbations, explaining resilience to surface-level changes. The results support a data-centric view of unlearning and highlight practical implications for real-world deployment under imperfect forget data.

Abstract

Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data

TL;DR

Large language models memorize sensitive data, raising privacy and safety concerns, which motivates unlearning of undesirable content. This paper investigates LLM unlearning when the forget data is perturbed during training, introducing masked, rewritten, and watermarked forget sets. It benchmarks two state-of-the-art unlearning methods, NPO and RMU, on WMDP and MUSE to assess robustness and utility. A saliency-based interpretation shows that core semantic cues governing forgetting are preserved across perturbations, explaining resilience to surface-level changes. The results support a data-centric view of unlearning and highlight practical implications for real-world deployment under imperfect forget data.

Abstract

Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

Paper Structure

This paper contains 31 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustrative examples of noisy forget data used during LLM unlearning training (left), and the performance (unlearning efficacy and utility) of the unlearned model evaluated on clean test data (right). (Left) Different perturbation types applied to forget data during unlearning training. These include: Mask, where partial or missing content is simulated (masked tokens are indicated by *); Rewrite, where LLMs are prompted to generate semantically equivalent variants; and Watermark, where identifiable signals are embedded while preserving semantic meaning (tokens containing watermark signals are highlighted in red). (Right) Performance evaluation of two representative unlearning methods, NPO zhang2024negative and RMU li2024wmdp, applied to the Zephyr-7b-beta model on the WMDP dataset li2024wmdp. The forget data used for unlearning contains different types of perturbations (Mask, Rewrite, Watermark). Unlearn Efficacy is reflected by the WMDP evaluation accuracy, where lower values indicate better unlearning performance. General Utility reflects MMLU accuracy, where higher values indicate better retention of general model utility. Compared with unlearning on the original forget data format, different perturbation types have minimal impact on unlearning performance.
  • Figure 2: Impact of masking ratio on unlearning performance across two representative unlearning methods, NPO and RMU, applied to the Zephyr-7b-beta model on the WMDP dataset li2024wmdp, where the masking ratio ($\delta$) varies from 0% to 90%. Here, 0% corresponds to the original, unmasked forget data. The unlearning performance is measured by Unlearn Efficacy and General Utility as shown in Fig. \ref{['fig:intro_overview']}.
  • Figure 3: Consistency of unlearning error rates under perturbed forget data. (a) Venn diagram showing the overlap in incorrectly answered WMDP questions between models unlearned with original and rewritten forget data. (b) Overlap ratios between the error sets of models unlearned with various perturbed forget sets, including Mask, Rewrite, WM(KGW), and WM(SynthID), and the baseline model trained with the original forget data.
  • Figure 4: Comparison of full data and salient token unlearning performance across different forget data types. This figure presents the unlearning efficacy of RMU on the WMDP dataset across three forget data perturbation types: Original Data, Mask, Rewrite, and WM(KGW). For each type, two variants are evaluated: using the entire forget set (Full Data) and using only LLM-as-judge-selected salient tokens (Salient Tokens). Results demonstrate that unlearning with salient tokens achieves efficacy comparable to Full Data unlearning across all settings, highlighting that a small, targeted subset of tokens is sufficient for effective unlearning when guided by LLM-based saliency selection.