Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

Lingzhi Wang; Xingshan Zeng; Jinsong Guo; Kam-Fai Wong; Georg Gottlob

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, Georg Gottlob

TL;DR

The paper tackles the privacy risk posed by neural language models memorizing sensitive training data and proposes SeUL, a selective unlearning framework that targets specific forget spans rather than entire instances. SeUL minimizes a span-level forgetting loss $\mathcal{L}_{UL}$ to suppress the targeted substrings while preserving broader knowledge and generation capabilities, addressing limitations of fully reversed objective methods. To rigorously evaluate forgetting of sensitive content, the authors introduce Sensitive Extraction Likelihood (S-EL) and Sensitive Memorization Accuracy (S-MA), along with online span selection and offline two-stage LLM-based annotation pipelines. Empirical results across GPT-Neo, Llama2, and Mistral models show SeUL effectively forgets sensitive information with minimal disruption to classification performance and improved generation metrics, including robustness under adversarial translation of knowledge. The work offers practical, privacy-preserving unlearning workflows for large language models, combining a concrete objective, targeted evaluation, and scalable annotation strategies.

Abstract

This paper explores Machine Unlearning (MU), an emerging field that is gaining increased attention due to concerns about neural models unintentionally remembering personal or sensitive information. We present SeUL, a novel method that enables selective and fine-grained unlearning for language models. Unlike previous work that employs a fully reversed training objective in unlearning, SeUL minimizes the negative impact on the capability of language models, particularly in terms of generation. Furthermore, we introduce two innovative evaluation metrics, sensitive extraction likelihood (S-EL) and sensitive memorization accuracy (S-MA), specifically designed to assess the effectiveness of forgetting sensitive information. In support of the unlearning framework, we propose efficient automatic online and offline sensitive span annotation methods. The online selection method, based on language probability scores, ensures computational efficiency, while the offline annotation involves a two-stage LLM-based process for robust verification. In summary, this paper contributes a novel selective unlearning method (SeUL), introduces specialized evaluation metrics (S-EL and S-MA) for assessing sensitive information forgetting, and proposes automatic online and offline sensitive span annotation methods to support the overall unlearning framework and evaluation process.

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

TL;DR

to suppress the targeted substrings while preserving broader knowledge and generation capabilities, addressing limitations of fully reversed objective methods. To rigorously evaluate forgetting of sensitive content, the authors introduce Sensitive Extraction Likelihood (S-EL) and Sensitive Memorization Accuracy (S-MA), along with online span selection and offline two-stage LLM-based annotation pipelines. Empirical results across GPT-Neo, Llama2, and Mistral models show SeUL effectively forgets sensitive information with minimal disruption to classification performance and improved generation metrics, including robustness under adversarial translation of knowledge. The work offers practical, privacy-preserving unlearning workflows for large language models, combining a concrete objective, targeted evaluation, and scalable annotation strategies.

Abstract

Paper Structure (35 sections, 7 equations, 4 figures, 5 tables)

This paper contains 35 sections, 7 equations, 4 figures, 5 tables.

Introduction
Related Work
Methodology.
Datasets.
Evaluation Metrics.
Selective Unlearning For LMs
Our Unlearning Methodology
Problem Formulation
Our SeUL Unlearning Method
How to Determine the Forgetting Span?
Online Selection
Offline Annotation
Evaluation of Unlearning
Sensitive Extraction Likelihood
Sensitive Memorization Accuracy
...and 20 more sections

Figures (4)

Figure 1: Illustration of knowledge injection attack.
Figure 2: Workflow of SeUL: Queries with predefined spans (either sensitive or within other definitions) can be inputted directly into SeUL. For queries without predefined spans, we conduct online selection before feeding them to SeUL.
Figure 3: Unlearning GPT-Neo 1.3B: (a) S-MA, S-EL, MA, EL, and (b) Accuracy, F1 Scores over epochs.
Figure 4: Unlearning GPT-Neo 125M: (a) Accuracy and (b) F1 Scores with varying $d$ values.

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

TL;DR

Abstract

Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)