Table of Contents
Fetching ...

Machine Unlearning via Information Theoretic Regularization

Shizhou Xu, Thomas Strohmer

TL;DR

The paper introduces an auditable, information-theoretic approach to machine unlearning that treats feature unlearning and marginal data-point unlearning within a unified rate–distortion framework. By regularizing the learning outcome with mutual information terms I(S';Z), it systematically reduces leakage of unwanted information while preserving task utility, and it provides theoretical guarantees linking information measures to post-hoc auditability and to anchor-based guarantees. A key analytic result shows that, under certain utilities, the optimal unlearning outcome is the Wasserstein-2 barycenter of the conditional data distributions, enabling a principled and efficient solution via optimal transport. The approach yields practical algorithms and validates them through numerical experiments on tabular and image data, demonstrating robust performance on both feature and data unlearning tasks. Overall, the work offers a principled, testable path away from unreliable retraining-on-retain guarantees toward neuroscience-inspired, information-theoretic marginal unlearning with auditability and applicability to diverse AI systems.

Abstract

How can we effectively remove or ''unlearn'' undesirable information, such as specific features or the influence of individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a unified mathematical framework based on information-theoretic regularization to address both data point unlearning and feature unlearning. For data point unlearning, we introduce the $\textit{Marginal Unlearning Principle}$, an auditable and provable framework inspired by memory suppression studies in neuroscience. Moreover, we provide formal information-theoretic unlearning definition based on the proposed principle, named marginal unlearning, and provable guarantees on sufficiency and necessity of marginal unlearning to the existing approximate unlearning definitions. We then show the proposed framework provide natural solution to the marginal unlearning problems. For feature unlearning, the framework applies to deep learning with arbitrary training objectives. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications. From a mathematical perspective, we provide an unified analytic solution to the optimal feature unlearning problem with a variety of information-theoretic training objectives. Our theoretical analysis reveals intriguing connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras. Numerical simulations support our theoretical finding.

Machine Unlearning via Information Theoretic Regularization

TL;DR

The paper introduces an auditable, information-theoretic approach to machine unlearning that treats feature unlearning and marginal data-point unlearning within a unified rate–distortion framework. By regularizing the learning outcome with mutual information terms I(S';Z), it systematically reduces leakage of unwanted information while preserving task utility, and it provides theoretical guarantees linking information measures to post-hoc auditability and to anchor-based guarantees. A key analytic result shows that, under certain utilities, the optimal unlearning outcome is the Wasserstein-2 barycenter of the conditional data distributions, enabling a principled and efficient solution via optimal transport. The approach yields practical algorithms and validates them through numerical experiments on tabular and image data, demonstrating robust performance on both feature and data unlearning tasks. Overall, the work offers a principled, testable path away from unreliable retraining-on-retain guarantees toward neuroscience-inspired, information-theoretic marginal unlearning with auditability and applicability to diverse AI systems.

Abstract

How can we effectively remove or ''unlearn'' undesirable information, such as specific features or the influence of individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a unified mathematical framework based on information-theoretic regularization to address both data point unlearning and feature unlearning. For data point unlearning, we introduce the , an auditable and provable framework inspired by memory suppression studies in neuroscience. Moreover, we provide formal information-theoretic unlearning definition based on the proposed principle, named marginal unlearning, and provable guarantees on sufficiency and necessity of marginal unlearning to the existing approximate unlearning definitions. We then show the proposed framework provide natural solution to the marginal unlearning problems. For feature unlearning, the framework applies to deep learning with arbitrary training objectives. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications. From a mathematical perspective, we provide an unified analytic solution to the optimal feature unlearning problem with a variety of information-theoretic training objectives. Our theoretical analysis reveals intriguing connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras. Numerical simulations support our theoretical finding.

Paper Structure

This paper contains 83 sections, 11 theorems, 95 equations, 4 figures, 3 tables, 3 algorithms.

Key Result

Lemma 2.1

Let $\Theta$ be an $\mathcal{H}$-valued random variable independent of $(X_{\textrm{margin}}, Z)$. Define the marginal output $\hat{Y}_{\mathrm{margin}} := f_\Theta(X_{\textrm{margin}})$, and the model output distribution tested on $p^d$ and $p^r$ respectively by Then,

Figures (4)

  • Figure 1: Feature–unlearning frontiers: Each row reports utility versus feature influence for a dataset: the left panel shows accuracy (↑) and the right panel shows AUROC (↑) against the demographic-parity gap (DP-gap, ↓), defined as $|\,\mathbb{P}[\hat{Y}{=}1\mid Z{=}1]-\mathbb{P}[\hat{Y}{=}1\mid Z{=}0]\,|$. Curves trace each method’s trade-off as its trade-off parameter varies. Points denote the mean and bands denote $\pm$1 s.d. over $5$ folds. Lower DP-gap at comparable or higher utility indicates a better frontier.
  • Figure 2: Evolution of output densities $p(f_\theta(X_r))$ (left column) and $p(f_\theta(X_u))$ (right column) over epochs for the three objectives. It is clear that marginal unlearning suppresses the unlearn signal (concentration around zero) while preserving the uniform density supported by the retain signal. In comparison, the method based on gradient ascent all push the mass concentration to somewhere else, even though they have different utility preservation that try to counter the destructive gradient ascent.
  • Figure 3: MNIST Unlearning trajectories on train (left panel) and test (right panel) folders with line representing the mean and shade width representing 1 standard deviation.
  • Figure 4: illustrates the unlearning outcomes for both tasks: The above penal shows the unlearn results for smile feature and the bottom shows the results for gender feature. In each of the panel, the first row, denoted by $X$, represents samples from the original data set with the chosen feature value (smile and female, respectively). The last row, denoted by $T(X)$ represents the push-forward image of the corresponding sample in $X$ by the generated optimal transport map $T$. The middle row, denoted by $\text{Bary}(X) := [0.5Id + 0.5T](x)$, represents the corresponding sample generated by the McCann interpolation at $t = 0.5$, which coincides with the barycenter in the our two-marginal cases.

Theorems & Definitions (20)

  • Definition 2.1: Feature unlearning (independence)
  • Definition 2.2: Mutual-information feature unlearning with utility budget
  • Remark 2.1: Human analogy: direct vs. marginal unlearning
  • Definition 2.3: ${\varepsilon}$-marginal unlearning
  • Definition 2.4: Mutual-information marginal unlearning with utility budget
  • Lemma 2.1: MI controls output drift
  • Theorem 2.1: Marginal unlearning $+$ utility $\Rightarrow$ approximate unlearning
  • Lemma 2.2: Marginal unlearning as a condition for "good" retraining
  • Proposition 2.1: Marginal unlearning reduces utility on the unlearned record
  • Definition 3.1: Pareto optimal feature unlearning
  • ...and 10 more