Table of Contents
Fetching ...

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

TL;DR

This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn, which achieves a balanced tradeoff with model utility.

Abstract

Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: https://github.com/OPTML-Group/AdvUnlearn

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

TL;DR

This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn, which achieves a balanced tradeoff with model utility.

Abstract

Diffusion models (DMs) have achieved remarkable success in text-to-image generation, but they also pose safety risks, such as the potential generation of harmful content and copyright violations. The techniques of machine unlearning, also known as concept erasing, have been developed to address these risks. However, these techniques remain vulnerable to adversarial prompt attacks, which can prompt DMs post-unlearning to regenerate undesired images containing concepts (such as nudity) meant to be erased. This work aims to enhance the robustness of concept erasing by integrating the principle of adversarial training (AT) into machine unlearning, resulting in the robust unlearning framework referred to as AdvUnlearn. However, achieving this effectively and efficiently is highly nontrivial. First, we find that a straightforward implementation of AT compromises DMs' image generation quality post-unlearning. To address this, we develop a utility-retaining regularization on an additional retain set, optimizing the trade-off between concept erasure robustness and model utility in AdvUnlearn. Moreover, we identify the text encoder as a more suitable module for robustification compared to UNet, ensuring unlearning effectiveness. And the acquired text encoder can serve as a plug-and-play robust unlearner for various DM types. Empirically, we perform extensive experiments to demonstrate the robustness advantage of AdvUnlearn across various DM unlearning scenarios, including the erasure of nudity, objects, and style concepts. In addition to robustness, AdvUnlearn also achieves a balanced tradeoff with model utility. To our knowledge, this is the first work to systematically explore robust DM unlearning through AT, setting it apart from existing methods that overlook robustness in concept erasing. Codes are available at: https://github.com/OPTML-Group/AdvUnlearn
Paper Structure (19 sections, 7 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 19 sections, 7 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Visualizations of SD v1.4 rombach2022high and its concept-erased version, ESD gandikota2023erasing, conditioned on the prompt 'Hot sexy girl'. The generation of ESD is shown against no attack and UnlearnDiffAtk zhang2023generate.
  • Figure 2: Performance overview of our proposal AdvUnlearn and various DM unlearning baselines when unlearning the nudity concept under the SD v1.4 model. The robustness is measured by attack success rate (ASR) against UnlearnDiffAtk zhang2023generate. The performance of image generation retention is assessed through Fréchet Inception Distance (FID). A lower ASR or FID implies better robustness or utility. The baselines include the vanilla SD v1.4 and its unlearned versions using ESD gandikota2023erasing, FMN zhang2023forget, UCE gandikota2023unified, SalUn fan2023salun, and SPM lyu2023one.
  • Figure 3: Generation examples using DMs in Tab. \ref{['tab: motivation']} for nudity unlearning conditioned on benign and harmful prompts.
  • Figure 4: Visualizations of generated images by SD v1.4 and its nudity-unlearned variants. First row: Image generation under the adversarial prompt attack UnlearnDiffAtk zhang2023generate applied to the original prompt 'woman standing in the doorway'. Second row: Image generation under the standard benign prompt 'clean white toilet'.
  • Figure 5: Examples of generated images by DMs when unlearning Van Gogh style, following Fig. \ref{['fig: visualization_for_nudity']}'s format with attack in 1st row.
  • ...and 6 more figures