Table of Contents
Fetching ...

Don't Forget Too Much: Towards Machine Unlearning on Feature Level

Heng Xu, Tianqing Zhu, Wanlei Zhou, Wei Zhao

TL;DR

The paper addresses the need for feature-level unlearning in deep models, proposing two schemes: (i) feature unlearning with known feature annotations using adversarial training to remove targeted features while preserving task-relevant information, and (ii) feature unlearning without annotations using interpretability-guided encoding and eigengap-driven feature identification followed by fine-tuning. It formalizes the problem with precise notations, introduces remover, identifier, and adversary components, and defines loss functions that balance information retention and feature removal. The authors provide extensive experiments on image data, including qualitative gradient visualizations and quantitative metrics, and show that their methods can effectively remove targeted feature information with minimal impact on main-task performance, while also enabling debiasing applications. The work offers practical, faster-than-retraining approaches to unlearning and lays groundwork for extending to other feature types and modalities, with future directions toward NLP and GAN contexts.

Abstract

Machine unlearning enables pre-trained models to remove the effect of certain portions of training data. Previous machine unlearning schemes have mainly focused on unlearning a cluster of instances or all instances belonging to a specific class. These types of unlearning might have a significant impact on the model utility; and they may be inadequate for situations where we only need to unlearn features within instances, rather than the whole instances. Due to the different granularity, current unlearning methods can hardly achieve feature-level unlearning. To address the challenges of utility and granularity, we propose a refined granularity unlearning scheme referred to as ``feature unlearning". We first explore two distinct scenarios based on whether the annotation information about the features is given: feature unlearning with known annotations and feature unlearning without annotations. Regarding unlearning with known annotations, we propose an adversarial learning approach to automatically remove effects about features. For unlearning without annotations, we initially enable the output of one model's layer to identify different pattern features using model interpretability techniques. We proceed to filter features from instances based on these outputs with identifying ability. So that we can remove the feature impact based on filtered instances and the fine-tuning process. The effectiveness of our proposed approach is demonstrated through experiments involving diverse models on various datasets in different scenarios.

Don't Forget Too Much: Towards Machine Unlearning on Feature Level

TL;DR

The paper addresses the need for feature-level unlearning in deep models, proposing two schemes: (i) feature unlearning with known feature annotations using adversarial training to remove targeted features while preserving task-relevant information, and (ii) feature unlearning without annotations using interpretability-guided encoding and eigengap-driven feature identification followed by fine-tuning. It formalizes the problem with precise notations, introduces remover, identifier, and adversary components, and defines loss functions that balance information retention and feature removal. The authors provide extensive experiments on image data, including qualitative gradient visualizations and quantitative metrics, and show that their methods can effectively remove targeted feature information with minimal impact on main-task performance, while also enabling debiasing applications. The work offers practical, faster-than-retraining approaches to unlearning and lays groundwork for extending to other feature types and modalities, with future directions toward NLP and GAN contexts.

Abstract

Machine unlearning enables pre-trained models to remove the effect of certain portions of training data. Previous machine unlearning schemes have mainly focused on unlearning a cluster of instances or all instances belonging to a specific class. These types of unlearning might have a significant impact on the model utility; and they may be inadequate for situations where we only need to unlearn features within instances, rather than the whole instances. Due to the different granularity, current unlearning methods can hardly achieve feature-level unlearning. To address the challenges of utility and granularity, we propose a refined granularity unlearning scheme referred to as ``feature unlearning". We first explore two distinct scenarios based on whether the annotation information about the features is given: feature unlearning with known annotations and feature unlearning without annotations. Regarding unlearning with known annotations, we propose an adversarial learning approach to automatically remove effects about features. For unlearning without annotations, we initially enable the output of one model's layer to identify different pattern features using model interpretability techniques. We proceed to filter features from instances based on these outputs with identifying ability. So that we can remove the feature impact based on filtered instances and the fine-tuning process. The effectiveness of our proposed approach is demonstrated through experiments involving diverse models on various datasets in different scenarios.
Paper Structure (19 sections, 8 equations, 11 figures, 3 tables, 3 algorithms)

This paper contains 19 sections, 8 equations, 11 figures, 3 tables, 3 algorithms.

Figures (11)

  • Figure 1: Feature unlearning helps to tackle the fairness problem. In each sub-figure, elements of different shapes represent different features; and different colors represent different feature annotations. Figure 1a shows that the inconsistent frequency of features (triangles) may lead to unfairness. For example, instances possessing the yellow triangle feature will have a greater likelihood of being classified as class $2$, while instances possessing the blue triangle would be classified as $1$. Fairness solutions typically aim to mitigate bias by equalizing the frequency of biased features across different classes, with the goal of transforming these features into normal features. Since these features do not play vital roles for classification, we can eliminate the bias by directly unlearning those features and retaining the remaining features for the model classification (as shown in Figure 1b).
  • Figure 2: A schematic view of feature unlearning with known annotations.
  • Figure 3: A schematic view of feature unlearning without annotations
  • Figure 4: Model structure for feature identification without annotations
  • Figure 5: Qualitative results based on the guided backpropagation DBLP:journals/corr/SpringenbergDBR14. In all Figures, the gradient map lacks information about unlearning features. Figure \ref{['fig:qualitative_c']} demonstrates the results without annotations, which illustrate that our interpretability-based unlearning scheme achieves almost the same results as the known annotations scheme (Figure \ref{['fig:qualitative_a']}). All results show that both types of unlearning schemes can remove the gradient information about unlearning features during the unlearning process.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1: Machine Unlearning DBLP:conf/sp/CaoY15
  • Definition 2: Feature Unlearning