Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch; Phillip Guo; Aidan Ewart; Stephen Casper; Dylan Hadfield-Menell

Eight Methods to Evaluate Robust Unlearning in LLMs

Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

TL;DR

This paper addresses the lack of standardized evaluation for LLM unlearning by surveying existing methods and introducing a comprehensive eight-item robustness and competitiveness framework applied to the Who's Harry Potter (WHP) model. It demonstrates that while the Familiarity metric indicates generalization of unlearning, there remains extractable knowledge and collateral unlearning in related domains, and that the WHP model can perform nearly as well as the original on downstream tasks. The work also employs trivia-based evaluations and latent-probing approaches to reveal hidden knowledge and reveals vulnerabilities to adversarial techniques. Overall, it highlights the necessity of adversarial, multi-faceted evaluation to reliably assess unlearning approaches and guide the development of more robust methods for safe deployment.

Abstract

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.

Eight Methods to Evaluate Robust Unlearning in LLMs

TL;DR

Abstract

Paper Structure (16 sections, 15 figures, 1 table)

This paper contains 16 sections, 15 figures, 1 table.

Introduction
Related Work
Tests for Robust and Competitive Unlearning
Discussion
Detailed Explanations
Familiarity Metric
Relearning through Fine-tuning
Latent Knowledge
Input Prompt Modifications
Jailbreak Prompts
Baseline Unlearning Prompts
Summaries
Downstream Tasks
Binary Answer Questions
Short Answer Questions
...and 1 more sections

Figures (15)

Figure 1: The WHP model's unlearning generalizes under the Familiarity metric, but different strategies can extract more information from it both in an absolute sense and relative to the original model. "Familiarity" (y-axis) is a measure introduced in Eldan2023WhosHP using GPT-4 evaluations of the correctness and relatedness of model generations to the Harry Potter universe (see Appendix \ref{['sec:appendix_familiarity']}). The dotted lines show the Harry Potter Familiarity for the base and WHP models. Orange WHP bars are consistently lower than blue LLaMA-2 model bars, demonstrating generalization of the WHP model's unlearning. However, our tests can increase the absolute Familiarity of the WHP model above the 0.09 baseline (as shown by orange bars above the orange baseline) and the Familiarity relative to the original model (as shown by deltas smaller than the 77% baseline -- marked in red).
Figure 2: Unlike Familiarity-based evaluations, trivia-based evaluations suggest only minor differences between the WHP and original models. (Left) Trivia-based evaluations of unlearning suggest that the WHP model performs comparably to the original model. It even performs better than the original model on short-answer trivia questions. (Right) Supervised and unsupervised probes can extract knowledge from the latent representations of the WHP model similarly well to the original model. The horizontal baselines are set based on the binary question-answering ability of the models shown on the left.
Figure 3: (Left) The WHP model beats a trivial prompting baseline which we instruct the model to behave as if it does not know about Harry Potter. (Right) The WHP model shows signs of unintended collateral unlearning in domains related to Harry Potter.Eldan2023WhosHP, found that the WHP model showed minimal evidence of unlearning on general knowledge but did not test knowledge on closely related domains. Here, using the same evaluation as Eldan2023WhosHP, we evaluate the Familiarity of the WHP model on other domains and find that in some, there are unintended Familiarity drops.
Figure 4: Example input and completions from Llama-2 and WHP: Both Llama-2 and WHP generate only 20 tokens with temperature 0.
Figure 5: Familiarity evaluation system prompt from Eldan2023WhosHP: GPT-4 generates a reasoning sequence, before writing "MODEL FAMILIARITY: X/3", from which we extract the score. The prompt is formatted with the datapoint references, prompt and model completion.
...and 10 more figures

Eight Methods to Evaluate Robust Unlearning in LLMs

TL;DR

Abstract

Eight Methods to Evaluate Robust Unlearning in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (15)