Table of Contents
Fetching ...

Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

Shizhou Xu, Yuan Ni, Stefan Broecker, Thomas Strohmer

TL;DR

The paper tackles unlearning in large language models by introducing Forgetting-MarI, an information-theoretic framework that penalizes marginal information to remove only the unlearned data's unique contributions while preserving retained knowledge. It formalizes marginal information via mutual information between a Marginal Information (MarI) signal and a binary indicator, and proposes an MI-based regularizer with explicit bounds on residual influence to ensure provable undetectability. The approach offers token-wise and pooled MarI estimators, enabling continual unlearning with stable utility preservation. Empirical results on mid-scale models (GPT-2 Large and Llama-3.2-1B) across copyright-like and domain datasets show Forgetting-MarI outperforms full-information baselines in forgetting efficacy and utility maintenance, with detector analyses supporting the theoretical guarantees.

Abstract

As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ''forget'' specific data. We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset's residual influence in the trained models, providing provable undetectability. Extensive experiments confirm that our approach outperforms current state-of-the-art unlearning methods, delivering reliable forgetting and better preserved general model performance across diverse benchmarks. This advancement represents an important step toward making AI systems more controllable and compliant with privacy and copyright regulations without compromising their effectiveness.

Forgetting-MarI: LLM Unlearning via Marginal Information Regularization

TL;DR

The paper tackles unlearning in large language models by introducing Forgetting-MarI, an information-theoretic framework that penalizes marginal information to remove only the unlearned data's unique contributions while preserving retained knowledge. It formalizes marginal information via mutual information between a Marginal Information (MarI) signal and a binary indicator, and proposes an MI-based regularizer with explicit bounds on residual influence to ensure provable undetectability. The approach offers token-wise and pooled MarI estimators, enabling continual unlearning with stable utility preservation. Empirical results on mid-scale models (GPT-2 Large and Llama-3.2-1B) across copyright-like and domain datasets show Forgetting-MarI outperforms full-information baselines in forgetting efficacy and utility maintenance, with detector analyses supporting the theoretical guarantees.

Abstract

As AI models are trained on ever-expanding datasets, the ability to remove the influence of specific data from trained models has become essential for privacy protection and regulatory compliance. Unlearning addresses this challenge by selectively removing parametric knowledge from the trained models without retraining from scratch, which is critical for resource-intensive models such as Large Language Models (LLMs). Existing unlearning methods often degrade model performance by removing more information than necessary when attempting to ''forget'' specific data. We introduce Forgetting-MarI, an LLM unlearning framework that provably removes only the additional (marginal) information contributed by the data to be unlearned, while preserving the information supported by the data to be retained. By penalizing marginal information, our method yields an explicit upper bound on the unlearn dataset's residual influence in the trained models, providing provable undetectability. Extensive experiments confirm that our approach outperforms current state-of-the-art unlearning methods, delivering reliable forgetting and better preserved general model performance across diverse benchmarks. This advancement represents an important step toward making AI systems more controllable and compliant with privacy and copyright regulations without compromising their effectiveness.

Paper Structure

This paper contains 38 sections, 9 theorems, 48 equations, 16 figures, 6 tables, 1 algorithm.

Key Result

Proposition 2.1

For $(X_{\mathop{\mathrm{MarI}}\nolimits},Z)$ with prior $\pi=\mathbb P[Z=1]$, where $H_2(\cdot)$ is the binary entropy and $H_2^{-1}$ denotes the inverse of $H_2$ restricted to $[0,\tfrac12]$.

Figures (16)

  • Figure 1: Comparison of sentence completions generated by Llama-3.2-1B models before and after different unlearning methods.
  • Figure 1: Forgetting-MarI.
  • Figure 2: Comparison of families of unlearning methods based on literature evidence. Our proposed marginal effect unlearning addresses key limitations of existing approaches. (✓=yes, ✗=no, ☆=partial)
  • Figure 2: Pseudo-code for Forgetting-MarI.
  • Figure 3: Left panels summarize the next-token accuracies on retain/unlearn/validation whereas the right panels summarize the general-capability on various benchmarks. Top row shows the results from Llama-3.2-1B on Careless People (correlated split), where as bottom row shows the results from GPT-2 Large on Harry Potter. Each method is reported at its best $\lambda$ and training epoch. On the left panels, an ideal method should match the unlearn baseline on retain/unlearn/validation accuracy on the left panels. On the right, better methods should achieve higher accuracy on ARC-E, HellaSwag, PIQA, MMLU and lower on WikiText perplexity test. Star indicates the best performer on that test.
  • ...and 11 more figures

Theorems & Definitions (12)

  • Definition 1.1: Marginal Information (MarI)
  • Proposition 2.1: Detection accuracy upper bounded by mutual information
  • Definition 2.1: MI-based marginal information loss
  • Remark 2.1: Alternative quantification
  • Theorem 2.1: MarI controls the self-perplexity gap
  • Theorem 3.1: Word-level provable unlearning via pooled MarI
  • Theorem 2.1: MarI controls neighborhood-perplexity gap
  • Lemma 2.1: Point-wise KL bound
  • Lemma 2.2
  • Lemma 2.3: Exact TV scaling under mixture
  • ...and 2 more