On the Necessity of Output Distribution Reweighting for Effective Class Unlearning
Ali Ebrahimpour-Boroojeny, Yian Wang, Hari Sundaram
TL;DR
The paper addresses privacy leakage in class unlearning by revealing that neglecting the geometry of remaining classes enables leakage under strong attacks. It introduces MIA-NN, a nearest-neighbor-based membership inference attack, and Tilted ReWeighting (TRW), a lightweight fine-tuning objective that redistributes forgotten-class probability mass using inter-class similarities and a maximum-entropy tilt. TRW more accurately mirrors the behavior of models retrained from scratch on the retained data and remains robust against both standard MIAs and the proposed MIA-NN, achieving near-retraining performance with modest computational cost. Empirical results across MNIST, CIFAR-10/100, and Tiny-ImageNet show thatTRW often matches or surpasses state-of-the-art unlearning methods on traditional metrics while offering stronger privacy guarantees, including under the stronger U-LiRA evaluation framework.
Abstract
In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause privacy leakage. We further propose a simple yet effective solution to mitigate this issue. We introduce a membership-inference attack via nearest neighbors (MIA-NN) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. Our experiments show that existing unlearning methods are vulnerable to MIA-NN across multiple datasets. We then propose a new fine-tuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting (TRW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores, accordingly, compared to the SOTA method for each category.
