Table of Contents
Fetching ...

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

Vitali Petsiuk, Kate Saenko

TL;DR

The paper exposes a fundamental vulnerability in diffusion-model safety: concept inhibition can be bypassed through compositional inference, enabling reconstruction of erased concepts without direct access to inhibited weights. It introduces ARC attacks, a theory-grounded framework where target guidance g*(c) can be approximated via linear combinations of inhibited guidance for distant concepts, formalized by g(c) = λ(c)·y0 + (1−λ(c))·g*(c) with an exponentially decaying λ. The authors validate the approach with both theory (Propositions P1–P4) and extensive experiments, showing that nudity and object-inhibition defenses remain brittle under multi-prompt composition across several inhibition techniques (AC, ESD, UCE). The work highlights a need for safety mechanisms that do not rely on localized, per-concept edits and proposes a framework to test and guide the development of more robust defenses against compositional, input-space attacks in diffusion models.

Abstract

Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models. Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised. Project page: https://cs-people.bu.edu/vpetsiuk/arc

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

TL;DR

The paper exposes a fundamental vulnerability in diffusion-model safety: concept inhibition can be bypassed through compositional inference, enabling reconstruction of erased concepts without direct access to inhibited weights. It introduces ARC attacks, a theory-grounded framework where target guidance g*(c) can be approximated via linear combinations of inhibited guidance for distant concepts, formalized by g(c) = λ(c)·y0 + (1−λ(c))·g*(c) with an exponentially decaying λ. The authors validate the approach with both theory (Propositions P1–P4) and extensive experiments, showing that nudity and object-inhibition defenses remain brittle under multi-prompt composition across several inhibition techniques (AC, ESD, UCE). The work highlights a need for safety mechanisms that do not rely on localized, per-concept edits and proposes a framework to test and guide the development of more robust defenses against compositional, input-space attacks in diffusion models.

Abstract

Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models. Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised. Project page: https://cs-people.bu.edu/vpetsiuk/arc
Paper Structure (23 sections, 22 equations, 8 figures, 3 tables)

This paper contains 23 sections, 22 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: While recent methods for erasing concepts in Diffusion Models successfully pass their respective evaluations (middle row), they do not entirely remove the target concept (such as zebra) from model weights as claimed. In this work, we propose a method to reproduce the erased concept using the inhibited models (bottom row).
  • Figure 2: Even if the computation of conditional guidance for target concept $g(c_t)$ ('zebra', 'car') is modified (inhibited with AC, ESD), we can use a detour concept $c_d$ ('cake', 'text') to compute $g(c_t+c_d)-g(c_d)$. We provide theoretical and empirical evidence that this guidance can be used to generate images with the target concept $c_t$.
  • Figure 3: Detection of nudity categories using NudeNet bedapudi_praneeth_2019_3584720_nudenet for the images generated with original and inhibited SD models for I2P schramowski2022SLD prompts. Inhibition is achieved using ESD-u gandikota2023ESD, UCE gandikota2023UCE, and Selective Amnesia heng2023Selective methods. While analysis of the Standard Inference (SI) alone shows a significant reduction in the generated nudity from original SD (gray) to inhibited SD (green), the Compositional Inference attacks (red) defined in Table \ref{['tab:attack-implementations']} demonstrate that the same inhibited models can still be used to generate undesired content. In some cases, performing the attacks on inhibited models even results in a higher nudity generation rates than those of the original SD model (red bars larger than gray).
  • Figure 4: Target concept reproduction rates (averaged over concepts) the original model (gray) and inhibited with various methods. Generation using the attacks from Table \ref{['tab:attack-implementations']} (red) demonstrates significantly higher reproduction rates of the "erased" concept compared to standard inference (green).
  • Figure 5: Attacked generation using the model with inhibited concept 'zebra' (AC-100). The reproduction rates (\ref{['fig:zebra-hista']}) show very few images for any percentile for the standard inference, while the O3 attack shows a significant number of images with high CLIP Scores. This is confirmed by the images with the highest CLIP Scores for the attacked generation (\ref{['fig:zebra-histc']}) and the corresponding images using standard inference (\ref{['fig:zebra-histb']}).
  • ...and 3 more figures

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof