Table of Contents
Fetching ...

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, Chiyuan Zhang

TL;DR

MUSE introduces a six-faceted benchmark to evaluate machine unlearning in language models, addressing both data owners' privacy/copyright concerns and deployers' practicality. It defines six criteria, establishes metrics including VerbMem, KnowMem, and PrivLeak, and benchmarks eight unlearning methods across News and Harry Potter datasets. The findings show that while most methods can remove memorization, they often degrade general utility and fail to prevent privacy leakage, with issues in scalability and sustainability under sequential unlearning. The work demonstrates that current unlearning approaches are not yet deployment-ready and provides a public benchmark to drive future improvements. Overall, MUSE highlights critical gaps between theoretical unlearning capabilities and real-world deployment requirements in large language models.

Abstract

Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

TL;DR

MUSE introduces a six-faceted benchmark to evaluate machine unlearning in language models, addressing both data owners' privacy/copyright concerns and deployers' practicality. It defines six criteria, establishes metrics including VerbMem, KnowMem, and PrivLeak, and benchmarks eight unlearning methods across News and Harry Potter datasets. The findings show that while most methods can remove memorization, they often degrade general utility and fail to prevent privacy leakage, with issues in scalability and sustainability under sequential unlearning. The work demonstrates that current unlearning approaches are not yet deployment-ready and provides a public benchmark to drive future improvements. Overall, MUSE highlights critical gaps between theoretical unlearning capabilities and real-world deployment requirements in large language models.

Abstract

Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io
Paper Structure (26 sections, 5 figures, 8 tables)

This paper contains 26 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: MUSE evaluation focuses on six key dimensions of machine unlearning, addressing both data owner and deployer expectations. For example, when an author (data owner) requests the unlearning of the Harry Potter books, they may expect the unlearned model to: (1) avoid generating verbatim copies of the text to protect copyright, (2) eliminate retention of factual knowledge from the books, and (3) not reveal whether the books were previously used in training to protect privacy. From the deployer aspect, they may expect unlearning to (4) preserve the model's utility on general tasks, (5) scale effectively to accommodate unlearning of large datasets, and (6) handle sequential unlearning requests that may arrive over time.
  • Figure 3: Distribution of Min-K% Prob, an MIA metric, for $\mathcal{D}_\textrm{forget}$, $\mathcal{D}_\textrm{holdout}$, and $\mathcal{D}_\textrm{retain}$. Consistent with the expected pattern in \ref{['fig:mia']}, $f_\textrm{retrain}$ shows perfect unlearning, with the overlapping distributions for $\mathcal{D}_\textrm{forget}$ and $\mathcal{D}_\textrm{holdout}$. Existing approximate unlearning methods typically either under-unlearn or over-unlearn. For example, $\textsf{GA}_\textsf{KLR}\xspace$ shows slight under-unlearning, while $\textsf{GA}_\textsf{GDR}\xspace$ over-unlearns, pushing the Min-K% Prob of $\mathcal{D}_\textrm{forget}$ to an extreme level.
  • Figure 4: ROC curves for $\mathcal{D}_\textrm{forget}$ vs. $\mathcal{D}_\textrm{holdout}$ on News using Min-K% Prob, with AUC scores in parentheses.$\textsf{AUC}$$\approx$0.5 (i.e., $f_\textrm{retrain}$) means no significant distribution difference between two sets (i.e., no membership leakage). Most unlearning methods show under-unlearn ($\textsf{AUC}$$\ll$0.5) or over-unlearn ($\textsf{AUC}$$\gg$0.5).
  • Figure 5: Utility preservation vs. knowledge memorization on BBC.$f_\textrm{retrain}$ maintains high utility on $\mathcal{D}_\textrm{retain}$ while showing low knowledge memorization on $\mathcal{D}_\textrm{forget}$. GA and NPO without regularizers show significant utility loss, collapsing to the origin. Every other unlearning method unlearns the knowledge on $\mathcal{D}_\textrm{forget}$ at the cost of utility.
  • Figure 6: The performance of GA, NPO, and their regularized variants, measured by utility preservation, degrades with larger forget set sizes (a) and sequential unlearning requests (b).