Table of Contents
Fetching ...

LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu

TL;DR

This work reveals a strong coreset effect in LLM unlearning, showing that using as little as 5% of the forget data can match full-set performance for unlearning across RMU and NPO on benchmarks like WMDP and MUSE, provided training is sufficiently extended. By combining empirical results with a keyword-based analysis, the authors demonstrate that high-impact tokens largely govern unlearning, and that keyword-only subsets can reproduce much of the effect, implying data redundancy in forget sets. The study further confirms faithfulness through mode-connectivity analyses and examines robustness to jailbreaking and downstream relearning, finding generally preserved utility but potential vulnerabilities in low-data regimes. These findings challenge current benchmark designs and motivate developing optimized coresets and richer evaluation protocols to better reflect realistic unlearning challenges.

Abstract

Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.

LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

TL;DR

This work reveals a strong coreset effect in LLM unlearning, showing that using as little as 5% of the forget data can match full-set performance for unlearning across RMU and NPO on benchmarks like WMDP and MUSE, provided training is sufficiently extended. By combining empirical results with a keyword-based analysis, the authors demonstrate that high-impact tokens largely govern unlearning, and that keyword-only subsets can reproduce much of the effect, implying data redundancy in forget sets. The study further confirms faithfulness through mode-connectivity analyses and examines robustness to jailbreaking and downstream relearning, finding generally preserved utility but potential vulnerabilities in low-data regimes. These findings challenge current benchmark designs and motivate developing optimized coresets and richer evaluation protocols to better reflect realistic unlearning challenges.

Abstract

Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.

Paper Structure

This paper contains 18 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Unveiling the "coreset"-like effect in LLM unlearning on WMDP using the RMU method, applied to the pre-trained LLM Zephyr-7B-$\beta$. The "coreset" is randomly sampled from the full forget set ($\mathcal{D}_\mathrm{f} =$ WMDP-Bio), with selection ratios of 1%, 5%, and 10%; the 100% setting corresponds to using the entire $\mathcal{D}_\mathrm{f}$. Unlearning performance is averaged over 5 random trials. (a) The "coreset" achieves comparable UE (unlearning effectiveness) for RMU on WMDP-Bio, especially when unlearning is performed with longer training epochs. The default number of unlearning epochs is 1 for RMU under the full $\mathcal{D}_\mathrm{f}$ (100%), as indicated by the shaded region. (b) MMLU-based UT (utility) of the unlearned model against the forget set selection ratio. Each box plot represents the UT performance of the unlearned model across the range of unlearning epochs shown in (a) for 5 random trials.
  • Figure 2: Consistent Random-based coreset unlearning performance in terms of UT and UE across against the coreset selection ratio. The performance is averaged over 5 independent trials for random coreset selection, with variance indicated by the shaded regions. (a)-(d) correspond to the results of applying a specific unlearning method (RMU or NPO) to a benchmark dataset (WMDP-Bio, WMDP-Cyber, MUSE-Books, or MUSE-News). Following the benchmark setting, unlearning is performed using Zephyr-7B-$\beta$ on WMDP, LLaMA2-7B on MUSE-News, and ICLM-7B on MUSE-Books.
  • Figure 3: Unlearning performance (UE and UT) using the original coreset and its keyword subset across varying coreset selection ratios for RMU-based unlearning on (WMDP-Bio, Zephyr-7B-$\beta$).
  • Figure 4: LMC holds between coreset-unlearned model (${\boldsymbol{\theta}}_\mathrm{cu}$) and the full forget set-unlearned model (${\boldsymbol{\theta}}_\mathrm{fu}$), as evidenced by UE against the interpolation coefficient $\alpha$ (x-axis). Here the coreset-unlearned models are obtained using Random-based coresets with the same setting as in Fig. \ref{['fig: random_sufficient']}(a-d).
  • Figure 5: Unlearning performance (UE) of Random-coreset unlearned models (using NPO under Zephyr-7B-$\beta$) against the number of fine-tuning samples. (a)-(f) Relearning using finetuning datasets (GSM8k, AGNews) for models unlearned on WMDP-Bio or WMDP-Cyber. The performance is averaged over 3 independent trials.
  • ...and 2 more figures