Table of Contents
Fetching ...

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Maurya, Zhiwei Steven Wu, Virginia Smith

TL;DR

The paper scrutinizes the reliability of current LLM unlearning benchmarks, showing that small benchmark perturbations can reveal unlearned information or impair retained knowledge beyond reported results. It analyzes forget/retain evaluation design, threat models, and brittle benchmark practices, demonstrating that dependencies between forget and retain data and test-set overfitting can mislead progress assessments. By conducting targeted experiments on TOFU and WMDP, the authors reveal how simple modifications decouple benchmark performance from practical unlearning robustness. They advocate for benchmarks that encourage generalization, explicit threat models, and formal definitions, and urge the community to pursue provable guarantees and privacy-by-construction approaches for reliable unlearning in LLMs.

Abstract

Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.

Position: LLM Unlearning Benchmarks are Weak Measures of Progress

TL;DR

The paper scrutinizes the reliability of current LLM unlearning benchmarks, showing that small benchmark perturbations can reveal unlearned information or impair retained knowledge beyond reported results. It analyzes forget/retain evaluation design, threat models, and brittle benchmark practices, demonstrating that dependencies between forget and retain data and test-set overfitting can mislead progress assessments. By conducting targeted experiments on TOFU and WMDP, the authors reveal how simple modifications decouple benchmark performance from practical unlearning robustness. They advocate for benchmarks that encourage generalization, explicit threat models, and formal definitions, and urge the community to pursue provable guarantees and privacy-by-construction approaches for reliable unlearning in LLMs.

Abstract

Unlearning methods have the potential to improve the privacy and safety of large language models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning research community has increasingly turned toward empirical benchmarks to assess the effectiveness of such methods. In this paper, we find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. By introducing simple, benign modifications to a number of popular benchmarks, we expose instances where supposedly unlearned information remains accessible, or where the unlearning process has degraded the model's performance on retained information to a much greater extent than indicated by the original benchmark. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information. Further, we show that ambiguity in unlearning targets in existing benchmarks can easily lead to the design of methods that overfit to the given test queries. Based on our findings, we urge the community to be cautious when interpreting benchmark results as reliable measures of progress, and we provide several recommendations to guide future LLM unlearning research.
Paper Structure (26 sections, 1 equation, 4 figures, 2 tables)

This paper contains 26 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We survey works on LLM unlearning published in 2024 and find that they predominantly evaluate unlearning success on fixed "forget set" benchmarks liu2024awesome. (a) Notably, the 5 most commonly-used benchmarks (red) account for nearly half of all evaluations, with TOFU maini2024tofu and WMDP li2024wmdp alone used in 31% of papers. (b) Additionally, papers evaluating on these common benchmarks (red) receive over 80% of total citations across the repository.
  • Figure 2: Anatomy of an LLM unlearning benchmark. LLM unlearning benchmarks typically consist of a set of forget and retain queries used to evaluate whether unlearning has been effective (3). Benchmarks may also optionally include (1) a base model for learning and task-specific data which will subsequently be unlearned; (2) forget and retain data to use for the process of unlearning; and (4) a specific set of metrics to be used for evaluation on the test queries. As we discuss, it is challenging to enumerate a completely representative set of forget and retain queries to be used for evaluation, and, coupled with the fact that (1,3,4) are often ill-specified in existing benchmarks, it is easy to design unlearning approaches that overfit to these queries---making it difficult to rely on benchmark results alone when assessing translational progress in this space.
  • Figure 3: ROUGE-L precision, recall, and F1 scores for three unlearning algorithms: gradient ascent, preference optimization, and ECO. (DPO refers to the preference optimization baseline measured in the TOFU paper maini2024tofu.) The ROUGE score is computed on retain set questions with respect to the correct (non-unlearned) answer, and higher is better. 'Retain only' refers to the score if only the retain set query is asked, and 'retain split' refers to the score when a retain query is paired with a forget query, but the ROUGE score is computed only on the response to the retain query. DPO and ECO suffer significantly when the retain and forget queries are asked together, even though the original retain performance is relatively high, while gradient ascent is more stable but its overall initial F1 score is lower.
  • Figure 4: Retain set performance of base Zephyr-7B model, RMU unlearning, and RMU unlearning with LAT robustness. Replacing one random, incorrect answer choice with a phrase associated with the forget data destroys the performance of unlearned models on retain queries, even though the correct answer is present and unchanged.