Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park; Sangdoo Yun; Jin-Hwa Kim; Junho Kim; Geonhui Jang; Yonghyun Jeong; Junghyo Jo; Gayoung Lee

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

TL;DR

This work tackles the vulnerability of prompt-based safety mechanisms in text-to-image models by introducing Direct Unlearning Optimization (DUO), an image-based unlearning framework. DUO uses SDEdit-generated paired data to form explicit preferences between unsafe and safe visual content and optimizes via Direct Preference Optimization, augmented with an output-preserving regularization to maintain prior generation quality for safe content. The approach demonstrates robust defenses against state-of-the-art red-teaming (both black-box and white-box) while preserving generation capabilities on unrelated prompts, as measured by FID, CLIP, and LPIPS across Nudity and Violence scenarios. By removing unsafe visual features directly from the model rather than manipulating prompts, DUO advances safe deployment of diffusion-based T2I models in both closed- and open-source settings, though the work acknowledges limitations around proximal unsafe features and the need for careful data curation. Overall, DUO represents a principled, data-efficient strategy to enhance the reliability and safety of T2I systems with practical implications for responsible AI deployment.

Abstract

Recent advancements in text-to-image (T2I) models have unlocked a wide range of applications but also present significant risks, particularly in their potential to generate unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. In this paper, we propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models while preserving their performance on unrelated topics. DUO employs a preference optimization approach using curated paired image data, ensuring that the model learns to remove unsafe visual concepts while retaining unrelated features. Furthermore, we introduce an output-preserving regularization term to maintain the model's generative capabilities on safe content. Extensive experiments demonstrate that DUO can robustly defend against various state-of-the-art red teaming methods without significant performance degradation on unrelated topics, as measured by FID and CLIP scores. Our work contributes to the development of safer and more reliable T2I models, paving the way for their responsible deployment in both closed-source and open-source scenarios.

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

TL;DR

Abstract

Paper Structure (44 sections, 23 equations, 17 figures, 4 tables)

This paper contains 44 sections, 23 equations, 17 figures, 4 tables.

Introduction
Related work
Text-to-Image (T2I) models with safety mechanisms.
Red-Teaming for T2I models.
Preference optimization in T2I models.
Method
Preliminary: Diffusion models
Synthesizing paired image data to resolve ambiguity in image-based unlearning
Concept unlearning as a preference optimization problem
Diffusion-DPO.
Output-preserving regularization
Experiments
Experiments setup
Unlearning setup.
Red teaming.
...and 29 more sections

Figures (17)

Figure 1: Visualization of the advantages of image-based unlearning. Prompt-based unlearning can be easily circumvented with adversarial prompt attack. On the other hand, image-based unlearning robustly produces safe images regardless of the given prompt. We use for publication purposes.
Figure 2: Importance of using both unsafe and paired safe images to preserve model prior. We use for publication purposes. Unsafe concept refers to what should be removed from the image (red), while unrelated concept refers to what should be retained in the image (green).
Figure 3: Effectiveness of utilizing SDEdit for generating paired image data for unlearning. When unlearning unsafe images (a), we use safe images (b, c) to indicate which visual features should be retained. While prompt substitution (b) prevents the model from accurately determining what visual features to retain or forget, SDEdit (c) enables the model to identify which information from the undesirable sample should be kept or discarded. We use for publication purposes.
Figure 4: Quantitative result on nudity. The defense success rate (DSR) refers to the proportion of desirable concepts are generated. Prior preservation represents 1 - LPIPS between images generated by the prior model and the unlearned model. Results closer to the top right indicate better outcomes.
Figure 5: Qualitative result on nudity. We used $\beta=500$ for Ring-A-bell and $\beta=250$ for Concept Inversion. We use for publication purposes.
...and 12 more figures

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

TL;DR

Abstract

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)