Multi-Class Unlearning for Image Classification via Weight Filtering
Samuele Poppi, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
TL;DR
This paper tackles the problem of removing all knowledge of chosen classes from a pre-trained image classifier in a single unlearning round. It introduces WF-Net, a Weight-Filtering framework that encapsulates inner network components with learnable memory matrices to enable class-specific, orthogonal unlearning in one shot, for CNNs and Vision Transformers. The method yields accurate retention on remaining classes while driving near-zero accuracy on forget classes, and provides explainability by revealing class–component associations via the learned memory ($\alpha$ matrices). Experiments on MNIST, CIFAR-10, and ImageNet-1k demonstrate effective forgetting, competitive retraining metrics, and interpretable weight–class mappings, with measurable ZRF gains and informative insertion/deletion explainability scores. This work advances practical privacy-preserving unlearning by reducing retraining cost and enabling runtime selection of forgotten classes, effectively unlearning all $N_c$ classes in one pass.
Abstract
Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning.
