Table of Contents
Fetching ...

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

Uichan Lee, Jeonghyeon Kim, Sangheum Hwang

TL;DR

High-Level Representation Misdirection (HiRM) is proposed, which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions, while updating only early layers that contain causal states of visual attributes.

Abstract

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

TL;DR

High-Level Representation Misdirection (HiRM) is proposed, which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions, while updating only early layers that contain causal states of visual attributes.

Abstract

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.
Paper Structure (30 sections, 2 equations, 20 figures, 14 tables)

This paper contains 30 sections, 2 equations, 20 figures, 14 tables.

Figures (20)

  • Figure 1: Overview of HiRM. (a) HiRM updates only the first block of the text encoder while steering the final-layer representations of target concepts (e.g., "Van Gogh") toward designated directions, effectively decoupling update location and erasure target. (b) Compared to existing methods, HiRM achieves a better balance between concept erasure and utility preservation. (c) HiRM demonstrates favorable trade-offs in terms of training time and overall performance, measured as the average of erasure and retention scores across style, object, and robustness settings.
  • Figure 2: Comparison of t-SNE visualizations of HiRM-R. Each figure compares token embeddings before and after erasure, where blue circles represent the original embeddings and orange X markers represent the embeddings after erasure. The left columns of each figure visualize embeddings from the final transformer block, and the right columns show embeddings from the first block.
  • Figure 3: MLLM-as-a-judge prompt example.
  • Figure 4: HiRM-R sample response.
  • Figure 5: Stable Diffusion (SD) sample response.
  • ...and 15 more figures