Table of Contents
Fetching ...

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

Myeongseob Ko, Henry Li, Zhun Wang, Jonathan Patsenker, Jiachen T. Wang, Qinbin Li, Ming Jin, Dawn Song, Ruoxi Jia

TL;DR

This paper tackles the challenge of unlearning targeted content in text-to-image diffusion models without sacrificing alignment to retained concepts. It introduces a principled restricted-gradient update that monotonically improves both forgetting and remaining-data losses, while also proposing a dataset-diversification strategy for $D_r$ to avoid overfitting. The approach outperforms baselines on both class-level forgetting in CIFAR-10 diffusion models and concept-level removals (nudity, art style) in Stable Diffusion, achieving superior forgetting with close-to-original alignment. The work advances practical unlearning by offering a mathematically grounded update rule and data-collection tactics that enhance safety and copyright compliance for diffusion-based generation systems.

Abstract

Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. However, this often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. Driven by these concerns, machine unlearning has become crucial to effectively purge undesirable knowledge from models. While existing literature has studied various unlearning techniques, these often suffer from either poor unlearning quality or degradation in text-image alignment after unlearning, due to the competitive nature of these objectives. To address these challenges, we propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives. We further derive the characterization of such an update. In addition, we design procedures to strategically diversify the unlearning and remaining datasets to boost performance improvement. Our evaluation demonstrates that our method effectively removes target classes from recent diffusion-based generative models and concepts from stable diffusion models while maintaining close alignment with the models' original trained states, thus outperforming state-of-the-art baselines. Our code will be made available at https://github.com/reds-lab/Restricted_gradient_diversity_unlearning.git.

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

TL;DR

This paper tackles the challenge of unlearning targeted content in text-to-image diffusion models without sacrificing alignment to retained concepts. It introduces a principled restricted-gradient update that monotonically improves both forgetting and remaining-data losses, while also proposing a dataset-diversification strategy for to avoid overfitting. The approach outperforms baselines on both class-level forgetting in CIFAR-10 diffusion models and concept-level removals (nudity, art style) in Stable Diffusion, achieving superior forgetting with close-to-original alignment. The work advances practical unlearning by offering a mathematically grounded update rule and data-collection tactics that enhance safety and copyright compliance for diffusion-based generation systems.

Abstract

Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. However, this often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. Driven by these concerns, machine unlearning has become crucial to effectively purge undesirable knowledge from models. While existing literature has studied various unlearning techniques, these often suffer from either poor unlearning quality or degradation in text-image alignment after unlearning, due to the competitive nature of these objectives. To address these challenges, we propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives. We further derive the characterization of such an update. In addition, we design procedures to strategically diversify the unlearning and remaining datasets to boost performance improvement. Our evaluation demonstrates that our method effectively removes target classes from recent diffusion-based generative models and concepts from stable diffusion models while maintaining close alignment with the models' original trained states, thus outperforming state-of-the-art baselines. Our code will be made available at https://github.com/reds-lab/Restricted_gradient_diversity_unlearning.git.

Paper Structure

This paper contains 35 sections, 3 theorems, 21 equations, 11 figures, 8 tables.

Key Result

Theorem 2

Let $\mathbf{f}$ be a function on $\mathbf{x}$. Then the maximum value of the directional derivative of $\mathbf{f}$ at $\mathbf{x}$ is $|\nabla \mathbf{f}(\mathbf{x})|$ the $\ell^2$ norm of its gradient. Moreover, the direction $\mathbf{v}$ is the gradient itself, i.e.,

Figures (11)

  • Figure 1: Generated images using SalUnfan2023salun, ESDgandikota2023erasing, and Ours after unlearning given the condition. Each row indicates different unlearning tasks: nudity removal, and Van Gogh style removal. Generated images from our approach and SDrombach2022high are well-aligned with the prompt, whereas SalUn and ESD fail to generate semantically correct images given the condition. On average, across 100 different prompts, SalUn shows the lowest clip alignment scores (0.305 for nudity removal and 0.280 for Van Gogh style removal), followed by ESD (0.329 and 0.330, respectively). Our approach achieves scores of 0.350 and 0.352 for these tasks, closely matching the original SD scores of 0.352 and 0.348.
  • Figure 2: Visualization of the update. We show the update direction (gray) obtained by (a) directly summing up the two gradients and (b) our restricted gradient.
  • Figure 3: Generated images using SD, SalUn, ESD-u, and RGD (Ours). Each row indicates generated images with different prompts including nudity-related I2P prompts and samples from $D_f$. Each column shows the images generated by different unlearning methods.
  • Figure 4: The nudity detection results by Nudenet, following prior works fan2023salungandikota2023erasing. The Y-axis shows the exposed body part in the generated images, given the prompt, and the X-axis denotes the number of images generated by each unlearning method and SD. We exclude bars from the plot if the corresponding value is zero.
  • Figure 5: Art style removal. Each row represents different prompts used to evaluate the alignment and each column indicates generated images from different unlearning methods.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 1: Directional Derivative
  • Theorem 2: Directional derivative maximizer is the gradient
  • Definition 3: Restricted gradient, local form for minimization
  • Theorem 4: Characterizing the restricted gradient under linear approximation
  • Remark 1
  • Lemma 5: Projected gradients obtain optimal solution to a constrained objective
  • proof : Proof of Lemma \ref{['lem:optimality_of_grad_surg']}
  • proof : Proof of Theorem \ref{['thm:restricted_optimality_of_grad_surg']}