Table of Contents
Fetching ...

Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

Leyang Li, Shilin Lu, Yan Ren, Adams Wai-Kin Kong

TL;DR

The paper tackles removing unwanted concepts from text-to-image diffusion outputs by addressing a core limitation of prior methods: the disruption of early-stage structure during finetuning. It introduces ANT, a trajectory-aware framework that reverses the CFG condition direction during mid-to-late denoising to erase content without compromising the natural image manifold, and couples this with a four-term loss that preserves early-stage guidance while pushing away undesired modes. A heavy-hitters mechanism using intersection of weight saliency maps identifies a small, targeted parameter subset for finetuning, enabling efficient single-concept erasure; the approach also serves as a plug-and-play improvement for multi-concept erasure frameworks like MACE through LoRA-based fusion with a closed-form solution. Across NSFW, celebrity, and art-style erasure tasks, ANT achieves state-of-the-art results with strong erasure and preservation trade-offs and maintains high image quality, underscoring its practical potential for safe and scalable diffusion-based generation. The work provides a general, effective, and efficient strategy for content moderation in diffusion models with broad applicability and robustness considerations for future exploration, including newer architectures and adversarial prompts.

Abstract

Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT

Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

TL;DR

The paper tackles removing unwanted concepts from text-to-image diffusion outputs by addressing a core limitation of prior methods: the disruption of early-stage structure during finetuning. It introduces ANT, a trajectory-aware framework that reverses the CFG condition direction during mid-to-late denoising to erase content without compromising the natural image manifold, and couples this with a four-term loss that preserves early-stage guidance while pushing away undesired modes. A heavy-hitters mechanism using intersection of weight saliency maps identifies a small, targeted parameter subset for finetuning, enabling efficient single-concept erasure; the approach also serves as a plug-and-play improvement for multi-concept erasure frameworks like MACE through LoRA-based fusion with a closed-form solution. Across NSFW, celebrity, and art-style erasure tasks, ANT achieves state-of-the-art results with strong erasure and preservation trade-offs and maintains high image quality, underscoring its practical potential for safe and scalable diffusion-based generation. The work provides a general, effective, and efficient strategy for content moderation in diffusion models with broad applicability and robustness considerations for future exploration, including newer architectures and adversarial prompts.

Abstract

Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at https://github.com/lileyang1210/ANT

Paper Structure

This paper contains 22 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Geometric perspective on concept erasure in diffusion models. (a) Conventional Denoising Trajectory. A high-dimensional Gaussian sample, starting on a large sphere, converges to the human data manifold via classifier-free guidance (CFG). (b) Anchor-Free Finetuned Trajectory. Finetuning often modifies the orientation of the predicted conditional score functions so that they direct away from the unwanted concept manifold. This results in a condition direction $\boldsymbol{\delta}(\boldsymbol{c}) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{z}_{t}, t, \boldsymbol{c}) - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{z}_{t}, t)$ nearly opposite to that of the original model, making the trajectory more likely to produce out-of-distribution samples. Note that, in the absence of an unconditional constraint, modifications to the conditional output also affect the unconditional output due to shared model parameters. (c) Anchor-Based Finetuned Trajectory. The model is finetuned so that the predicted score functions (or keys & values) for the unwanted concept align with those of the original model conditioned on a benign anchor, ensuring final samples lie on the anchor manifold, though not necessarily at the highest-probability mode. (d) Our Trajectory (ANT). In the early stage (when $t>t^\prime$), the conditional score functions remain directed toward the natural data mode, keeping the finetuned model aligned with the original. When $t<t^\prime$, they are finetuned to point away from the unwanted concept manifold. ANT encourages that unconditional score functions remain unchanged throughout all stages.
  • Figure 2: Generation results of different concept erasure methods conditioned on the concept "cat". The anchor-free method (ESD) often produces images with visual artifacts or content that is out of distribution. The anchor-based method (MACE), which maps "cat" to "forest", performs reasonably well in simple contexts but results in unnatural or incoherent outputs in more complex scenarios. In contrast, our trajectory-aware method (ANT) effectively removes the target concept while preserving the overall structure and contextual integrity of the generated images.
  • Figure 3: Effect of condition direction reversal at different timesteps. Each column represents a distinct semantic condition, and each row shows generated outputs under varying reversal strategies. (a) displays originally generated images using a diffusion process (timestep 50→1). (b)–(d) show results when the condition direction $\boldsymbol{\delta}(\boldsymbol{c}) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{z}_{t}, t, \boldsymbol{c}) - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{z}_{t}, t)$ is reversed at different timesteps (25, 35, and 45). With a proper $t'$, specific attributes can be removed while preserving image naturalness. If $t'$ is too early, structural integrity is lost; if too late, only fine details are affected.
  • Figure 4: Each subplot shows the number of active parameters (y-axis) against the number of intersected saliency maps (x-axis) for four concepts: (a) Nudity, (b) Donald Trump, (c) Van Gogh Style, and (d) Dog. The number of active parameters converges across different concept types with around 100 intersected saliency maps.
  • Figure 5: Generation of the concept-specific saliency map $\boldsymbol{M}^*$. GPT-4 generates prompts $\mathcal{C} = \{c_i\}_{i=1}^{N_c}$, each paired with random seeds $\mathcal{S} = \{s_j\}_{j=1}^{N_s}$, which are used to compute gradient maps. After thresholding, saliency maps are obtained, and their intersection across all prompts and seeds yields $\boldsymbol{M}^*$.
  • ...and 3 more figures