Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee
TL;DR
This work tackles the safety and ethical deployment challenges of large-scale text-to-image diffusion models by introducing Human Feedback Inversion (HFI). HFI collects human judgments on model outputs, trains a reward model $r_\psi$, and inverts feedback into a soft token $v^*$ that guides concept removal; a self-distillation-based fine-tuning objective $L_{SDD}$ then adapts the diffusion model to suppress the target concept while preserving image quality, with training focused on cross-attention layers and mid-trajectory timesteps drawn from a Beta$(\alpha,\beta)$ distribution where $\alpha=\beta=3$. The framework is demonstrated on both artist-style removal and harmful-content scenarios, showing that HFI+SDD outperforms existing inference-time and fine-tuning baselines in reducing unsafe outputs while maintaining perceptual quality, aided by CLIP-based evaluations and human judgments. By encoding human perspective directly into soft tokens and leveraging reward-guided inversion, the approach provides a cost-effective, scalable path to safer diffusion models for public use. The work highlights practical considerations for data collection and the importance of aligning model behavior with nuanced human judgments, laying groundwork for more robust, human-centric safety mechanisms in generative AI.
Abstract
This paper addresses the societal concerns arising from large-scale text-to-image diffusion models for generating potentially harmful or copyrighted content. Existing models rely heavily on internet-crawled data, wherein problematic concepts persist due to incomplete filtration processes. While previous approaches somewhat alleviate the issue, they often rely on text-specified concepts, introducing challenges in accurately capturing nuanced concepts and aligning model knowledge with human understandings. In response, we propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. The proposed framework can be built upon existing techniques for the same purpose, enhancing their alignment with human judgment. By doing so, we simplify the training objective with a self-distillation-based technique, providing a strong baseline for concept removal. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
