Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Sanghyun Kim; Seohyeon Jung; Balhae Kim; Moonseok Choi; Jinwoo Shin; Juho Lee

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

TL;DR

This work tackles the safety and ethical deployment challenges of large-scale text-to-image diffusion models by introducing Human Feedback Inversion (HFI). HFI collects human judgments on model outputs, trains a reward model $r_\psi$, and inverts feedback into a soft token $v^*$ that guides concept removal; a self-distillation-based fine-tuning objective $L_{SDD}$ then adapts the diffusion model to suppress the target concept while preserving image quality, with training focused on cross-attention layers and mid-trajectory timesteps drawn from a Beta$(\alpha,\beta)$ distribution where $\alpha=\beta=3$. The framework is demonstrated on both artist-style removal and harmful-content scenarios, showing that HFI+SDD outperforms existing inference-time and fine-tuning baselines in reducing unsafe outputs while maintaining perceptual quality, aided by CLIP-based evaluations and human judgments. By encoding human perspective directly into soft tokens and leveraging reward-guided inversion, the approach provides a cost-effective, scalable path to safer diffusion models for public use. The work highlights practical considerations for data collection and the importance of aligning model behavior with nuanced human judgments, laying groundwork for more robust, human-centric safety mechanisms in generative AI.

Abstract

This paper addresses the societal concerns arising from large-scale text-to-image diffusion models for generating potentially harmful or copyrighted content. Existing models rely heavily on internet-crawled data, wherein problematic concepts persist due to incomplete filtration processes. While previous approaches somewhat alleviate the issue, they often rely on text-specified concepts, introducing challenges in accurately capturing nuanced concepts and aligning model knowledge with human understandings. In response, we propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. The proposed framework can be built upon existing techniques for the same purpose, enhancing their alignment with human judgment. By doing so, we simplify the training objective with a self-distillation-based technique, providing a strong baseline for concept removal. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

TL;DR

, and inverts feedback into a soft token

that guides concept removal; a self-distillation-based fine-tuning objective

then adapts the diffusion model to suppress the target concept while preserving image quality, with training focused on cross-attention layers and mid-trajectory timesteps drawn from a Beta

distribution where

. The framework is demonstrated on both artist-style removal and harmful-content scenarios, showing that HFI+SDD outperforms existing inference-time and fine-tuning baselines in reducing unsafe outputs while maintaining perceptual quality, aided by CLIP-based evaluations and human judgments. By encoding human perspective directly into soft tokens and leveraging reward-guided inversion, the approach provides a cost-effective, scalable path to safer diffusion models for public use. The work highlights practical considerations for data collection and the importance of aligning model behavior with nuanced human judgments, laying groundwork for more robust, human-centric safety mechanisms in generative AI.

Abstract

Paper Structure (42 sections, 9 equations, 24 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 9 equations, 24 figures, 5 tables, 1 algorithm.

Introduction
Background on Diffusion Models
Method
Collecting and Modeling Human Feedback
Inverting Feedback into Embeddings
Removing Learned Concepts with Self-Distillation
Related Work
Experiments
Baselines and Evaluation Protocols
Artist Style Removal
Harmful Concept Removal
Analysis on Rewards and Learned Embeddings
Conclusion and Discussion
Limitation.
Algorithm
...and 27 more sections

Figures (24)

Figure 1: Comparative analysis of NSFW content removal techniques, including our proposed fine-tuning method (SDD), both with and without HFI framework. The results clearly show that incorporating HFI significantly reduces the amplification of provocative body parts and ensures the generation of clothed representations.
Figure 2: Artist style removal results on Vincent van Gogh. Images are generated with prompts such as "a painting of flowers in the style of Vincent van Gogh". HFI+SDD effectively eliminates his distinct artistic features while preserving the essence of the original subject matter, subject integrity, and visual quality.
Figure 3: Comparison of images generated with famous artwork titles of the three artists. HFI+SDD, leveraging human feedback, captures and removes distinctive artistic styles more effectively while preserving the integrity and essence of the original paintings.
Figure 4: Illustration of our proposed framework, and .
Figure 5: Comparison of concept removal methods for Vincent van Gogh's artistic style. Van Gogh's style is marked by bold brushstrokes, vivid colors, and swirling patterns. HFI+SDD is the most effective, consistently producing the most neutral but still artwork-like images and successfully eliminating his stylistic characteristics.
...and 19 more figures

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

TL;DR

Abstract

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Authors

TL;DR

Abstract

Table of Contents

Figures (24)