Machine Unlearning via Information Theoretic Regularization
Shizhou Xu, Thomas Strohmer
TL;DR
The paper introduces an auditable, information-theoretic approach to machine unlearning that treats feature unlearning and marginal data-point unlearning within a unified rate–distortion framework. By regularizing the learning outcome with mutual information terms I(S';Z), it systematically reduces leakage of unwanted information while preserving task utility, and it provides theoretical guarantees linking information measures to post-hoc auditability and to anchor-based guarantees. A key analytic result shows that, under certain utilities, the optimal unlearning outcome is the Wasserstein-2 barycenter of the conditional data distributions, enabling a principled and efficient solution via optimal transport. The approach yields practical algorithms and validates them through numerical experiments on tabular and image data, demonstrating robust performance on both feature and data unlearning tasks. Overall, the work offers a principled, testable path away from unreliable retraining-on-retain guarantees toward neuroscience-inspired, information-theoretic marginal unlearning with auditability and applicability to diverse AI systems.
Abstract
How can we effectively remove or ''unlearn'' undesirable information, such as specific features or the influence of individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a unified mathematical framework based on information-theoretic regularization to address both data point unlearning and feature unlearning. For data point unlearning, we introduce the $\textit{Marginal Unlearning Principle}$, an auditable and provable framework inspired by memory suppression studies in neuroscience. Moreover, we provide formal information-theoretic unlearning definition based on the proposed principle, named marginal unlearning, and provable guarantees on sufficiency and necessity of marginal unlearning to the existing approximate unlearning definitions. We then show the proposed framework provide natural solution to the marginal unlearning problems. For feature unlearning, the framework applies to deep learning with arbitrary training objectives. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications. From a mathematical perspective, we provide an unified analytic solution to the optimal feature unlearning problem with a variety of information-theoretic training objectives. Our theoretical analysis reveals intriguing connections between machine unlearning, information theory, optimal transport, and extremal sigma algebras. Numerical simulations support our theoretical finding.
