Table of Contents
Fetching ...

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer, Vishal M. Patel, Karthik Nandakumar

TL;DR

STEREO introduces a two-stage framework for adversarially robust concept erasure in text-to-image diffusion models. The first stage uses adversarial training as vulnerability identification to discover strong embedding-space prompts, while the second stage applies an anchor-concept–based compositional objective to erase the target concept in a single fine-tune, preserving benign content quality. Across nudity, art style, and object removal, STEREO achieves superior robustness to white-box and black-box attacks with only marginal utility loss, significantly outperforming existing robust concept erasure methods. The approach reduces the risk of concept regeneration via embedding-space attacks and offers practical, faster deployment relative to prior methods, with potential applicability to multiple concepts in future work.

Abstract

The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security; concept-erased models (CEMs) can still be manipulated via adversarial attacks to regenerate the erased concept. While a few robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding space attacks. These limitations stem from the failure of robust CEMs to thoroughly search for "blind spots" in the embedding space. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step rather than the only step for robust concept erasure. In the first stage, STEREO employs adversarial training as a vulnerability identification mechanism to search thoroughly enough. In the second robustly erase once stage, STEREO introduces an anchor-concept-based compositional objective to robustly erase the target concept in a single fine-tuning stage, while minimizing the degradation of model utility. We benchmark STEREO against seven state-of-the-art concept erasure methods, demonstrating its superior robustness to both white-box and black-box attacks, while largely preserving utility.

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

TL;DR

STEREO introduces a two-stage framework for adversarially robust concept erasure in text-to-image diffusion models. The first stage uses adversarial training as vulnerability identification to discover strong embedding-space prompts, while the second stage applies an anchor-concept–based compositional objective to erase the target concept in a single fine-tune, preserving benign content quality. Across nudity, art style, and object removal, STEREO achieves superior robustness to white-box and black-box attacks with only marginal utility loss, significantly outperforming existing robust concept erasure methods. The approach reduces the risk of concept regeneration via embedding-space attacks and offers practical, faster deployment relative to prior methods, with potential applicability to multiple concepts in future work.

Abstract

The rapid proliferation of large-scale text-to-image diffusion (T2ID) models has raised serious concerns about their potential misuse in generating harmful content. Although numerous methods have been proposed for erasing undesired concepts from T2ID models, they often provide a false sense of security; concept-erased models (CEMs) can still be manipulated via adversarial attacks to regenerate the erased concept. While a few robust concept erasure methods based on adversarial training have emerged recently, they compromise on utility (generation quality for benign concepts) to achieve robustness and/or remain vulnerable to advanced embedding space attacks. These limitations stem from the failure of robust CEMs to thoroughly search for "blind spots" in the embedding space. To bridge this gap, we propose STEREO, a novel two-stage framework that employs adversarial training as a first step rather than the only step for robust concept erasure. In the first stage, STEREO employs adversarial training as a vulnerability identification mechanism to search thoroughly enough. In the second robustly erase once stage, STEREO introduces an anchor-concept-based compositional objective to robustly erase the target concept in a single fine-tuning stage, while minimizing the degradation of model utility. We benchmark STEREO against seven state-of-the-art concept erasure methods, demonstrating its superior robustness to both white-box and black-box attacks, while largely preserving utility.
Paper Structure (18 sections, 5 equations, 13 figures, 10 tables)

This paper contains 18 sections, 5 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Vulnerability of "robust" concept erasure methods to concept inversion attacks. Even state-of-the-art concept erasure methods such as RECE gong2024reliable, RACE kim2024race, and AdvUnlearn zhang2024defensive that claim to be "adversarially robust" are still vulnerable to concept inversion attacks pham2023circumventing that regenerate the erased concept by operating in the embedding space. Examples of this vulnerability across diverse categories such as Artistic-Style, Nudity, and Object are shown in this figure. Our STEREO method achieves superior robustness through its two-stage framework: thorough vulnerability identification via adversarial training followed by anchor-concept guided erasure.
  • Figure 2: Overview of STEREO. Our novel two-stage approach robustly erases target concepts from pre-trained text-to-image diffusion models while preserving high utility for benign concepts. Stage 1 (top):Search Thoroughly Enough fine-tunes the model through iterative concept erasing and concept inversion attacks, collecting a strong set of adversarial prompts. Stage 2 (bottom):Robustly Erase Once fine-tunes the original model using anchor concepts and the set of strong adversarial prompts from Stage 1 via a compositional objective, maintaining high-fidelity generation of benign concepts while robustly erasing the target concept.
  • Figure 3: Erasing only concept synonyms is effective but remains vulnerable to attacks, as the "Church" concept is regenerated under the CCE pham2023circumventing attack. The proposed STEREO approach identifies strong adversarial prompts $P^{*}$, facilitating robust concept erasure and making the model resistant to inversion attacks.
  • Figure 4: Performance of robust concept erasure methods for "nudity", including RECE, RACE, and AdvUnlearn, under black-box (RAB) and white-box (UD, CCE) attacks. While all methods are vulnerable to concept regeneration when attacked by the powerful CCE attack, our proposed STEREO demonstrates resilience, effectively preventing the regeneration of erased concepts.
  • Figure 5: (Top-row) Performance of concept erasure methods under the CCE attack for Van Gogh art style erasing. (Bottom-row) Utility preservation on a benign art style ("Girl with a Pearl Earring by Jan Vermeer"). In both cases, STEREO outperforms other methods, demonstrating superior robustness against adversarial attacks and better utility preservation.
  • ...and 8 more figures