Table of Contents
Fetching ...

An Information Theoretic Evaluation Metric For Strong Unlearning

Dongjae Jeon, Wonje Jeung, Taeheon Kim, Albert No, Jonghyun Choi

TL;DR

This paper addresses the inadequacy of output-only metrics for evaluating strong unlearning in deep networks. It introduces the Information Difference Index ($ ext{IDI}$), a white-box, information-theoretic metric that quantifies residual information about forgotten data in intermediate encoder representations via mutual information estimated with the InfoNCE objective. To close the gap revealed by IDI, it proposes COLA (COLlapse and Align), a two-stage method that collapses forget-set representations in the encoder and then realigns the model to remove residual information while preserving performance. Across CIFAR-10/100 and ImageNet-1K with ResNet and ViT architectures, COLA achieves near-zero IDI and competitive accuracy, while black-box metrics often fail to detect residual information; the work argues for adopting IDI (and COLA) as part of a robust, multi-metric evaluation framework for strong unlearning in real-world deployments.

Abstract

Machine unlearning (MU) aims to remove the influence of specific data from trained models, addressing privacy concerns and ensuring compliance with regulations such as the ``right to be forgotten.'' Evaluating strong unlearning, where the unlearned model is indistinguishable from one retrained without the forgetting data, remains a significant challenge in deep neural networks (DNNs). Common black-box metrics, such as variants of membership inference attacks and accuracy comparisons, primarily assess model outputs but often fail to capture residual information in intermediate layers. To bridge this gap, we introduce the Information Difference Index (IDI), a novel white-box metric inspired by information theory. IDI quantifies retained information in intermediate features by measuring mutual information between those features and the labels to be forgotten, offering a more comprehensive assessment of unlearning efficacy. Our experiments demonstrate that IDI effectively measures the degree of unlearning across various datasets and architectures, providing a reliable tool for evaluating strong unlearning in DNNs.

An Information Theoretic Evaluation Metric For Strong Unlearning

TL;DR

This paper addresses the inadequacy of output-only metrics for evaluating strong unlearning in deep networks. It introduces the Information Difference Index (), a white-box, information-theoretic metric that quantifies residual information about forgotten data in intermediate encoder representations via mutual information estimated with the InfoNCE objective. To close the gap revealed by IDI, it proposes COLA (COLlapse and Align), a two-stage method that collapses forget-set representations in the encoder and then realigns the model to remove residual information while preserving performance. Across CIFAR-10/100 and ImageNet-1K with ResNet and ViT architectures, COLA achieves near-zero IDI and competitive accuracy, while black-box metrics often fail to detect residual information; the work argues for adopting IDI (and COLA) as part of a robust, multi-metric evaluation framework for strong unlearning in real-world deployments.

Abstract

Machine unlearning (MU) aims to remove the influence of specific data from trained models, addressing privacy concerns and ensuring compliance with regulations such as the ``right to be forgotten.'' Evaluating strong unlearning, where the unlearned model is indistinguishable from one retrained without the forgetting data, remains a significant challenge in deep neural networks (DNNs). Common black-box metrics, such as variants of membership inference attacks and accuracy comparisons, primarily assess model outputs but often fail to capture residual information in intermediate layers. To bridge this gap, we introduce the Information Difference Index (IDI), a novel white-box metric inspired by information theory. IDI quantifies retained information in intermediate features by measuring mutual information between those features and the labels to be forgotten, offering a more comprehensive assessment of unlearning efficacy. Our experiments demonstrate that IDI effectively measures the degree of unlearning across various datasets and architectures, providing a reliable tool for evaluating strong unlearning in DNNs.
Paper Structure (65 sections, 11 equations, 21 figures, 18 tables, 2 algorithms)

This paper contains 65 sections, 11 equations, 21 figures, 18 tables, 2 algorithms.

Figures (21)

  • Figure 1: Performance of six methods on (CIFAR-10, ResNet-18), evaluated in efficiency (RTE), accuracy (TA), and efficacy (MIA, JSD). For TA, MIA, and JSD, lower differences from Retrain are preferred, indicating closer similarity to Retrain.
  • Figure 2: t-SNE visualizations of encoder outputs for Original, Retrain, and unlearned models from three MU methods (GA, RL, SALUN) on single-class forgetting with (CIFAR-10, ResNet-18). In each t-SNE plot, features of the forgetting class are represented in purple. Original and HD have identical feature distribution as they share the same encoder.
  • Figure 3: Forget test accuracy and IDI (our metric in \ref{['section_idi']}) for Original, Retrain, and MU methods (including COLA, our method in \ref{['section_cola']}) after head retraining with fixed unlearned encoders using 2% of $\mathcal{D}$ in (CIFAR-10, ResNet-18). IDI aligns with the recovered accuracy.
  • Figure 4: (a) Conceptual illustration of IDI. Curves show estimated mutual information $I(\mathbf{Z}_\ell; Y)$ for Original (●), unlearned (▲), and Retrain (★). IDI is the ratio $\frac{\textcolor{red}{ID(\mathbf{\theta_u})}}{\textcolor{blue}{ID(\mathbf{\theta_o})}}$, corresponding to the red area divided by the blue area. (b) MI curves and IDI values for Original, Retrain, and unlearned models (FT, RL, GA, $\ell_1$-sparse, SCRUB, SALUN) on CIFAR-10 across ResNet-18 (left) and ResNet-50 (right) blocks, averaged over five trials. See \ref{['app: C.2']} for standard deviations.
  • Figure 5: Illustration of estimating MI using InfoNCE. $f_{\nu_\ell}$ represents a trainable network to capture features from $\mathbf{Z}_\ell$, while $g_{\eta_\ell}$ handles the binary input $Y$.
  • ...and 16 more figures