Table of Contents
Fetching ...

FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models

Subhodip Panda, Prathosh AP

TL;DR

The paper tackles the problem of controlling outputs from black-box generative models by filtering undesired content, proposing FAST (Feature Aware Similarity Thresholding) that encodes undesired feature representations in latent space and applies similarity-based thresholds to suppress offending samples. It establishes a theoretical link between filtering and weak unlearning in black-box settings and demonstrates strong empirical performance under a few-shot regime, including scenarios with only negative feedback. The approach is validated on MNIST with DC-GAN and CelebA-HQ with StyleGAN2, showing improvements in recall and AUC over baselines while maintaining distributional quality metrics like FID, density, and coverage. The work highlights the practical significance of filtering as a privacy/compliance mechanism for black-box generative services and outlines future work on disentangled latent spaces and diffusion-model extensions to broaden applicability.

Abstract

The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pre-trained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pre-trained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. The primary goal of this study is twofold: first, to elucidate the relationship between filtering and unlearning processes, and second, to formulate a methodology aimed at mitigating the display of undesirable outputs generated from models characterized as black-box systems. Theoretical analysis in this study demonstrates that, in the context of black-box models, filtering can be seen as a form of weak unlearning. Our proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space.

FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models

TL;DR

The paper tackles the problem of controlling outputs from black-box generative models by filtering undesired content, proposing FAST (Feature Aware Similarity Thresholding) that encodes undesired feature representations in latent space and applies similarity-based thresholds to suppress offending samples. It establishes a theoretical link between filtering and weak unlearning in black-box settings and demonstrates strong empirical performance under a few-shot regime, including scenarios with only negative feedback. The approach is validated on MNIST with DC-GAN and CelebA-HQ with StyleGAN2, showing improvements in recall and AUC over baselines while maintaining distributional quality metrics like FID, density, and coverage. The work highlights the practical significance of filtering as a privacy/compliance mechanism for black-box generative services and outlines future work on disentangled latent spaces and diffusion-model extensions to broaden applicability.

Abstract

The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pre-trained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pre-trained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. The primary goal of this study is twofold: first, to elucidate the relationship between filtering and unlearning processes, and second, to formulate a methodology aimed at mitigating the display of undesirable outputs generated from models characterized as black-box systems. Theoretical analysis in this study demonstrates that, in the context of black-box models, filtering can be seen as a form of weak unlearning. Our proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space.
Paper Structure (28 sections, 2 theorems, 24 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 2 theorems, 24 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

If there exists a retrained generative model with parameters ${\theta^r}$ such that $d_{ln-TV}( \ln P_{\mathcal{X}_r}|| \ln P_{\theta^r}) \leq \epsilon_1$ and a generative model with posthoc blocking layer denoting parameters ${\theta^b}$ such that $d_{ln-TV}(\ln P_{\mathcal{X}_r}|| \ln P_{\theta^b}

Figures (3)

  • Figure 1: Filtering as Weak Unlearning in Black-Box Generative Models: The left-most block represents the black-box generator with posthoc blocking layer with parameters $\theta^b = (\theta^{init}, t)$ whose output distribution $P_{\theta^b}$, the middle block represents a retrained generative model with output distribution $P_{\theta^r}$, the right-most block represents a $(\epsilon,0)$ unlearned model with output distribution $P_{\theta^u}$.
  • Figure 2: FAST filtering mechanism: In stage-1, the undesired latent feature is identified using only a few positive and negative samples marked by the user. The positive and negative samples are projected into the latent space using the Latent Projection Function ($\pi(.)$), and subsequently, the undesired feature is retrieved via the Undesired Representation Function ($g(.)$). In stage-2, during the inference phase, new test samples are projected into the latent space, and the similarity of their projection with the undesired feature obtained from stage-1 is measured to filter out the negative samples.
  • Figure 3: Results of positive mining given only 20 negative samples by the user for MNIST dataset. Samples are generated by the pre-trained GAN after obtaining the positive samples through positive mining.

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Corollary 1.1