Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee
TL;DR
This work tackles the vulnerability of prompt-based safety mechanisms in text-to-image models by introducing Direct Unlearning Optimization (DUO), an image-based unlearning framework. DUO uses SDEdit-generated paired data to form explicit preferences between unsafe and safe visual content and optimizes via Direct Preference Optimization, augmented with an output-preserving regularization to maintain prior generation quality for safe content. The approach demonstrates robust defenses against state-of-the-art red-teaming (both black-box and white-box) while preserving generation capabilities on unrelated prompts, as measured by FID, CLIP, and LPIPS across Nudity and Violence scenarios. By removing unsafe visual features directly from the model rather than manipulating prompts, DUO advances safe deployment of diffusion-based T2I models in both closed- and open-source settings, though the work acknowledges limitations around proximal unsafe features and the need for careful data curation. Overall, DUO represents a principled, data-efficient strategy to enhance the reliability and safety of T2I systems with practical implications for responsible AI deployment.
Abstract
Recent advancements in text-to-image (T2I) models have unlocked a wide range of applications but also present significant risks, particularly in their potential to generate unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. In this paper, we propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models while preserving their performance on unrelated topics. DUO employs a preference optimization approach using curated paired image data, ensuring that the model learns to remove unsafe visual concepts while retaining unrelated features. Furthermore, we introduce an output-preserving regularization term to maintain the model's generative capabilities on safe content. Extensive experiments demonstrate that DUO can robustly defend against various state-of-the-art red teaming methods without significant performance degradation on unrelated topics, as measured by FID and CLIP scores. Our work contributes to the development of safer and more reliable T2I models, paving the way for their responsible deployment in both closed-source and open-source scenarios.
