Table of Contents
Fetching ...

TOFU: A Task of Fictitious Unlearning for LLMs

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter

TL;DR

TOFU introduces a principled benchmark for evaluating unlearning in large language models using synthetic fictitious authors to enable controlled forgetting. It defines a two-axis evaluation framework—Forget Quality (via KS-test on Truth Ratio distributions) and Model Utility (aggregated across four evaluation datasets using probability, ROUGE, and Truth Ratio)—and reports baseline results showing current unlearning methods struggle to forget without harming performance. The study demonstrates that simple baselines produce limited forget quality, reveal knowledge entanglement, and underscore the need for novel unlearning approaches and richer evaluation. It also discusses limitations, such as finetuning-only setup and the challenge of approximating indistinguishability, and outlines future directions for more effective and scalable forgetting in LLMs.

Abstract

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

TOFU: A Task of Fictitious Unlearning for LLMs

TL;DR

TOFU introduces a principled benchmark for evaluating unlearning in large language models using synthetic fictitious authors to enable controlled forgetting. It defines a two-axis evaluation framework—Forget Quality (via KS-test on Truth Ratio distributions) and Model Utility (aggregated across four evaluation datasets using probability, ROUGE, and Truth Ratio)—and reports baseline results showing current unlearning methods struggle to forget without harming performance. The study demonstrates that simple baselines produce limited forget quality, reveal knowledge entanglement, and underscore the need for novel unlearning approaches and richer evaluation. It also discusses limitations, such as finetuning-only setup and the challenge of approximating indistinguishability, and outlines future directions for more effective and scalable forgetting in LLMs.

Abstract

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.
Paper Structure (39 sections, 5 equations, 33 figures, 4 tables)

This paper contains 39 sections, 5 equations, 33 figures, 4 tables.

Figures (33)

  • Figure 1: TOFU is a well-defined unlearning task that comes with a dataset of fictitious author profiles used for finetuning and a subset of them make up the forget set.
  • Figure 2: The most frequent words in the final TOFU dataset (left), based on the system prompt described in the paper; and in an initial version of a 50-author dataset based on a simple prompt (right). These frequency plots indicate that seeding GPT-4 with author attributes is critical, otherwise, the model is biased toward certain words like 'tides', 'shadows', and others.
  • Figure 3: Examples of question answer pairs from all four datasets used in evaluating model utility and forget quality. View the entire dataset on https://huggingface.co/datasets/locuslab/TOFU.
  • Figure 4: Histograms of Truth Ratio values and empirical CDFs from various models and datasets. Left: Llama-2-7B and Phi trained on the $90\%$ retain set and evaluated on the same retain set; Middle: Llama-2-7B trained on the $90\%$ retain set, and evaluated on both the $90\%$ retain set and the $10\%$ forget set; Right: Llama-2-7B trained on the $90\%$ retain set and on the entire finetuning set, both evaluated on the $10\%$ forget set. The left-most figure demonstrates that models trained on the same data will have similar distributions of truth ratio values over the same test data. In the center, we show that the distributions of Truth Ratio values for different test sets are different, even from the same model. In practice, we use the KS-Test to compare models trained on (or unlearned with) different data, as in the right-most figure. The $p$-values corresponding to these three settings are 0.9003, 1.097e-19, and 2.428e-19, left to right.
  • Figure 5: Forget Quality versus Model Utility for Phi models when unlearning on Forget Set sizes of 1%, 5%, and 10% (left to right) and the relative size of the markers indicates the epoch of unlearning. Unlearning is challenging and comes with trade-offs. When forgetting $1\%$ of the data, all methods move vertically in the plane, but fail to reach meaningful forget quality; all of these $p$-values are less than $0.001$. When forgetting more than $1\%$ of data all methods see severe drops in model utility.
  • ...and 28 more figures