Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

Hongsheng Hu; Shuo Wang; Tian Dong; Minhui Xue

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

Hongsheng Hu, Shuo Wang, Tian Dong, Minhui Xue

TL;DR

This work reveals a previously underexplored privacy risk in machine unlearning by introducing unlearning inversion attacks that exploit differences between the original model $\boldsymbol{\theta}$ and the unlearned model $\boldsymbol{\theta}_u$ under an MLaaS threat model. It formalizes two attack modes: feature recovery via gradient inversion using $\nabla^* = \boldsymbol{\theta} - \boldsymbol{\theta}_u$ and a gradient-based optimization with a dummy input and TV prior, and label inference via probing samples and output-difference analysis, including a ZOO-based black-box probing strategy. Through extensive experiments on CIFAR-10/100, STL-10 plus medical and financial datasets, the authors demonstrate effective feature recovery under approximate unlearning and reliable label inference across multiple settings, while exact unlearning shows leakage primarily in small-data regimes. The paper also proposes defenses—parameter obfuscation, model pruning, and fine-tuning—that mitigate leakage but incur significant reductions in unlearned-model utility, underscoring the need for safer unlearning designs. Overall, this work introduces an essential measurement tool for privacy risk in unlearning and motivates future work on designing unlearning methods with provable privacy guarantees without sacrificing practical performance.

Abstract

Machine unlearning has become a promising solution for fulfilling the "right to be forgotten", under which individuals can request the deletion of their data from machine learning models. However, existing studies of machine unlearning mainly focus on the efficacy and efficiency of unlearning methods, while neglecting the investigation of the privacy vulnerability during the unlearning process. With two versions of a model available to an adversary, that is, the original model and the unlearned model, machine unlearning opens up a new attack surface. In this paper, we conduct the first investigation to understand the extent to which machine unlearning can leak the confidential content of the unlearned data. Specifically, under the Machine Learning as a Service setting, we propose unlearning inversion attacks that can reveal the feature and label information of an unlearned sample by only accessing the original and unlearned model. The effectiveness of the proposed unlearning inversion attacks is evaluated through extensive experiments on benchmark datasets across various model architectures and on both exact and approximate representative unlearning approaches. The experimental results indicate that the proposed attack can reveal the sensitive information of the unlearned data. As such, we identify three possible defenses that help to mitigate the proposed attacks, while at the cost of reducing the utility of the unlearned model. The study in this paper uncovers an underexplored gap between machine unlearning and the privacy of unlearned data, highlighting the need for the careful design of mechanisms for implementing unlearning without leaking the information of the unlearned data.

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

TL;DR

This work reveals a previously underexplored privacy risk in machine unlearning by introducing unlearning inversion attacks that exploit differences between the original model

and the unlearned model

under an MLaaS threat model. It formalizes two attack modes: feature recovery via gradient inversion using

and a gradient-based optimization with a dummy input and TV prior, and label inference via probing samples and output-difference analysis, including a ZOO-based black-box probing strategy. Through extensive experiments on CIFAR-10/100, STL-10 plus medical and financial datasets, the authors demonstrate effective feature recovery under approximate unlearning and reliable label inference across multiple settings, while exact unlearning shows leakage primarily in small-data regimes. The paper also proposes defenses—parameter obfuscation, model pruning, and fine-tuning—that mitigate leakage but incur significant reductions in unlearned-model utility, underscoring the need for safer unlearning designs. Overall, this work introduces an essential measurement tool for privacy risk in unlearning and motivates future work on designing unlearning methods with provable privacy guarantees without sacrificing practical performance.

Abstract

Paper Structure (29 sections, 11 equations, 20 figures, 4 tables)

This paper contains 29 sections, 11 equations, 20 figures, 4 tables.

Introduction
Related Work and Threat Model
Related Work
Our Threat Model
Methodology
Problem Statement
Unlearning Inversion Attacks
Unlearning Inversion Attacks for Feature Recovery
Unlearning Inversion Attacks for Label Inference
Experimental Settings
Datasets and Models
Machine Unlearning Settings
Evaluation
Effectiveness of Feature Recovery
Effectiveness of Label Inference
...and 14 more sections

Figures (20)

Figure 1: An overview of the unlearning inversion attack in machine unlearning. Given access to the original model and the unlearned model, an attacker can mount unlearning inversion attacks to reveal the information of unlearned data.
Figure 2: An overview of unlearning inversion attacks in the MLaaS environment. The server leverages the differences between two model parameters to recover the unlearned sample's feature. The user uses the differences between two models' prediction outputs to infer the unlearned sample's label.
Figure 3: The recovered data from the exact and approximate unlearning on CIFAR-10, CIFAR-100, and STL-10.
Figure 4: The prediction confidence changes of probing samples for exact unlearning and approximate unlearning. The class with a bar below 0 with the largest height represents the inferred label of the unlearned data.
Figure 5: Feature recovery when the training dataset size is 8. The image IDs along the middle column are used to mark specific image positions for the presentation of Table \ref{['tab:metric_stat']}.
...and 15 more figures

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

TL;DR

Abstract

Learn What You Want to Unlearn: Unlearning Inversion Attacks against Machine Unlearning

Authors

TL;DR

Abstract

Table of Contents

Figures (20)