Pruning for Robust Concept Erasing in Diffusion Models

Tianyun Yang; Juan Cao; Chang Xu

Pruning for Robust Concept Erasing in Diffusion Models

Tianyun Yang, Juan Cao, Chang Xu

TL;DR

The paper tackles the robustness gap in concept erasing for diffusion-based image generation, where adversarial prompts can resurrect erased concepts. It introduces a differentiable pruning framework that learns a parameter-level mask to disable critical pathways linked to erased concepts, and integrates this with existing erasing objectives such as ESD and AC. Across Nudity, Style, and Object erasure tasks, the proposed P-ESD and P-AC methods significantly improve robustness to adversarial prompts (approaching a 40% gain in NSFW erasure and ~30% in style erasure) while maintaining generation quality (FID comparable to baselines). The approach reduces sensitivity of concept-related pathways, offers recoverable erasing with lightweight storage, and provides a practical path toward safer deployment of diffusion models in open-world settings.

Abstract

Despite the impressive capabilities of generating images, text-to-image diffusion models are susceptible to producing undesirable outputs such as NSFW content and copyrighted artworks. To address this issue, recent studies have focused on fine-tuning model parameters to erase problematic concepts. However, existing methods exhibit a major flaw in robustness, as fine-tuned models often reproduce the undesirable outputs when faced with cleverly crafted prompts. This reveals a fundamental limitation in the current approaches and may raise risks for the deployment of diffusion models in the open world. To address this gap, we locate the concept-correlated neurons and find that these neurons show high sensitivity to adversarial prompts, thus could be deactivated when erasing and reactivated again under attacks. To improve the robustness, we introduce a new pruning-based strategy for concept erasing. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs. Experimental results show a significant enhancement in our model's ability to resist adversarial inputs, achieving nearly a 40% improvement in erasing the NSFW content and a 30% improvement in erasing artwork style.

Pruning for Robust Concept Erasing in Diffusion Models

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 11 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Concept erasing in diffusion models
Neural network pruning
Robust Concept Erasing
Preliminary
Vulnerability of Concept Erasing
Pruning for Robust Concept Erasing
Experiments
Experimental Setups
Erasing Nudity
Erasing Style
Erasing Objects
Analysis of the Proposed Method
Conclusion
...and 12 more sections

Figures (11)

Figure 1: Left panel: semantic illustration of prior concept erasing methods (the top row) and our method (the bottom row). Right panel: concrete examples illustrate the vulnerability of prior concept-erasing methods and the robustness of our method.
Figure 2: Sensitivity score of concept and non-concept neurons when attacked. The results are obtained from the erased models for nudity, van gogh (style), and church (object).
Figure 3: Visualization of concept neurons in the original stable diffusion (SD) and the edited SD by the ESD gandikota2023erasing method. Redder regions indicate higher activation values. As seen, concept neuron are activated original SD (first column) and deactivated in edited model (second column) with test prompts. However, with adversarial prompts, those neurons are re-activated (third column). The captions in the gray box indicate the specific locations of concept neurons in the diffusion models.
Figure 4: Visualization examples. The black boxes in the first two rows are added by the authors to hide NSFW content for publication. The symbol ✓ represents successful concept erasure, and ✗ indicates a failure in concept erasure.
Figure 5: Sensitivity score comparison of concept neurons between ESD and P-ESD.
...and 6 more figures

Pruning for Robust Concept Erasing in Diffusion Models

TL;DR

Abstract

Pruning for Robust Concept Erasing in Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)