Table of Contents
Fetching ...

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, Masashi Sugiyama

TL;DR

This work tackles the core problem of evaluating and comparing LLM unlearning methods in a robust and fair manner. It introduces Unlearning with Control (UWC), combining Extraction Strength ($ES$) as the primary unlearning metric with a calibration mechanism based on Model Mixing (MM) of parameters, formalized as $(1-\alpha)\boldsymbol{\theta}_{\mathrm{ref}}+\alpha\boldsymbol{\theta}$, to decouple removal strength from retention. The framework enables reliable cross-method comparisons, hyper-parameter guidance, and identification of practical tricks (e.g., Temperature Scaling) that enhance unlearning efficacy without compromising non-targeted knowledge. Through TOFU benchmark experiments on two LLMs, the authors show that proper hyper-parameter tuning can elevate GA-based methods, that GD/KL with retention terms mitigate collapse, and that MM facilitates fair assessment of removal strength. Overall, UWC offers a pragmatic roadmap for evaluating unlearning in real-world LLMs, with broad implications for safety, privacy, and policy compliance in AI systems.

Abstract

The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to refine the evaluation of LLM unlearning by addressing two key challenges -- a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this issue by devising and assessing a series of candidate metrics, selecting the most robust ones under various types of attacks. The second challenge arises from the conflicting goals of eliminating unwanted knowledge while retaining those of others. This trade-off between unlearning and retention often fails to conform the Pareto frontier, rendering it subtle to compare the efficacy between methods that excel only in either unlearning or retention. We handle this issue by proposing a calibration method that can restore the original performance on non-targeted data after unlearning, thereby allowing us to focus exclusively on assessing the strength of unlearning. Our evaluation framework notably enhances the effectiveness when assessing and comparing various LLM unlearning methods, further allowing us to benchmark existing works, identify their proper hyper-parameters, and explore new tricks to enhance their practical efficacy.

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

TL;DR

This work tackles the core problem of evaluating and comparing LLM unlearning methods in a robust and fair manner. It introduces Unlearning with Control (UWC), combining Extraction Strength () as the primary unlearning metric with a calibration mechanism based on Model Mixing (MM) of parameters, formalized as , to decouple removal strength from retention. The framework enables reliable cross-method comparisons, hyper-parameter guidance, and identification of practical tricks (e.g., Temperature Scaling) that enhance unlearning efficacy without compromising non-targeted knowledge. Through TOFU benchmark experiments on two LLMs, the authors show that proper hyper-parameter tuning can elevate GA-based methods, that GD/KL with retention terms mitigate collapse, and that MM facilitates fair assessment of removal strength. Overall, UWC offers a pragmatic roadmap for evaluating unlearning in real-world LLMs, with broad implications for safety, privacy, and policy compliance in AI systems.

Abstract

The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to refine the evaluation of LLM unlearning by addressing two key challenges -- a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this issue by devising and assessing a series of candidate metrics, selecting the most robust ones under various types of attacks. The second challenge arises from the conflicting goals of eliminating unwanted knowledge while retaining those of others. This trade-off between unlearning and retention often fails to conform the Pareto frontier, rendering it subtle to compare the efficacy between methods that excel only in either unlearning or retention. We handle this issue by proposing a calibration method that can restore the original performance on non-targeted data after unlearning, thereby allowing us to focus exclusively on assessing the strength of unlearning. Our evaluation framework notably enhances the effectiveness when assessing and comparing various LLM unlearning methods, further allowing us to benchmark existing works, identify their proper hyper-parameters, and explore new tricks to enhance their practical efficacy.
Paper Structure (17 sections, 15 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 17 sections, 15 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: For effective unlearning, it is preferable to have large ES scores for retention (x-axis) yet small for removal (y-axis). For the raw results (orange), we observe that GA excels at removal whereas NPO is better in retention, making it hard to determine which method is overall better. UWC resolves this challenge by aligning ES scores for retention, allowing us to focus on comparing the ES scores for unlearning (blue). It leads to the conclusion that NPO is overall superior.
  • Figure 2: Metric Robustness under Red Teaming Attacks. We depict the metric scores before (x-axis) and after (y-axis) attacks jointly for different unlearning setups: across 2 LLMs (Phi-1.5 and Llama-2-7B), 3 unlearning percentages (1%, 5%, and 10%), and 4 unlearning methods (GA, GD, PO, and NPO). We consider 3 representative metrics under 4 red teaming behaviors. We apply the log-scale for PPL to avoid numeric errors. For each of these scenarios, we compute the PPC with respect to targeted and non-targeted data respectively, displayed at the top of each figure (targeted data / non-targeted data). We provide linear fits for targeted and non-targeted data separately, accompanied by shaded areas representing the standard deviations that visualize the PPC scores.
  • Figure 3: ES Scores with MM Control. We depict values of $\alpha$ (x-axis) versus the ES scores (y-axis) on targeted (unlearn) and non-targeted (retain) data. We consider 2 LLMs (Phi-1.5 and Llama-2-7B) and 4 unlearning methods (GA, GD, PO, and NPO) under the 5% TOFU unlearning setup.
  • Figure 4: The causal graph for the assessment of unlearning metrics. The solid / dashed arrows represent known / unknown relationships.
  • Figure 5: Robustness of Metrics under Red Teaming Attacks. We depict the metric scores before (x-axis) and after (y-axis) attacks jointly for different unlearning setups: across 2 LLMs (Phi-1.5 and Llama-2-7B), 3 unlearning percentages (1%, 5%, and 10%), and 4 unlearning methods (GA, GD, PO, and NPO). We consider 5 different metrics under 4 red teaming behaviors. We apply the log-scale for PPL to avoid numeric errors. For each of these scenarios, we compute the PPC with respect to targeted and non-targeted data respectively, displayed at the top of each figure (targeted data / non-targeted data). We provide linear fits for targeted and non-targeted data separately, accompanied by shaded areas representing the standard deviation to further visualize the PPC scores.
  • ...and 2 more figures