ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Dong Han; Salaheldin Mohamed; Yong Li

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Dong Han, Salaheldin Mohamed, Yong Li

TL;DR

ShieldDiff tackles the problem of uncontrolled NSFW content generation in diffusion-based T2I models by introducing reinforcement learning with a dual content-safe reward that combines a Nudity detector and a CLIP-based semantic score. The method fine-tunes a pre-trained diffusion model using LoRA adapters and a black-box reward signal, optimizing $J(\theta)=\mathbb{E}_{c,x_0}[r(x_0,c)]$ via policy-gradient with DDPO and importance sampling. The Nudity reward is supplemented by a CLIP-based semantic reward to preserve prompt fidelity, and the approach is designed to be text-agnostic to resist prompt-based attacks. Empirical results show high Nudity Removal Rates (≈97–99%) across multiple datasets, stronger robustness to black-box attacks, and better semantic preservation than several SOTA methods, with extensions to I2I diffusion and face anonymization, supported by the proposed NRLSA metric for evaluating safe alignment.

Abstract

With the advance of generative AI, the text-to-image (T2I) model has the ability to generate various contents. However, the generated contents cannot be fully controlled. There is a potential risk that T2I model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model while maintaining the high quality of generated images by fine-tuning the pre-trained diffusion model via reinforcement learning by optimizing the well-designed content-safe reward function. The proposed method leverages a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents that adhere to the pret-rained model and keep the corresponding semantic meaning on the safe side. In this way, the T2I model is robust to unsafe adversarial prompts since unsafe visual representations are mitigated from latent space. Extensive experiments conducted on different datasets demonstrate the effectiveness of the proposed method in alleviating unsafe content generation while preserving the high-fidelity of benign images as well as images generated by unsafe prompts. We compare with five existing state-of-the-art (SOTA) methods and achieve competitive performance on sexual content removal and image quality retention. In terms of robustness, our method outperforms counterparts under the SOTA black-box attacking model. Furthermore, our constructed method can be a benchmark for anti-NSFW generation with semantically-relevant safe alignment.

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

TL;DR

via policy-gradient with DDPO and importance sampling. The Nudity reward is supplemented by a CLIP-based semantic reward to preserve prompt fidelity, and the approach is designed to be text-agnostic to resist prompt-based attacks. Empirical results show high Nudity Removal Rates (≈97–99%) across multiple datasets, stronger robustness to black-box attacks, and better semantic preservation than several SOTA methods, with extensions to I2I diffusion and face anonymization, supported by the proposed NRLSA metric for evaluating safe alignment.

Abstract

Paper Structure (18 sections, 4 equations, 10 figures, 4 tables)

This paper contains 18 sections, 4 equations, 10 figures, 4 tables.

Introduction
Background
Diffusion Models
LoRA Fine-Tuning
Reinforcement Learning (RL) in Fine-Tuning
Proposed Method
Overview
Reward
Text-Agnostic Methods
Experiments
Datasets and Implementation Details
Artifacts of Current SOTA Methods
Out-of-distribution (OOD) Performance
Numerical Evaluation
Rethink CLIP Score in Role of Nudity Elimination
...and 3 more sections

Figures (10)

Figure 1: Context preserving of our proposed nudity elimination method. The adversarial prompt used here: "elon musk boudoir photoshoot for Calvin klein". Images displaying nudity are censored by authors.
Figure 2: Reinforcement learning process for nudity elimination.
Figure 3: Context preserving safe reward.
Figure 4: Artifacts of SOTA methods under safe and unsafe prompts..
Figure 5: SafeGen mistakes “safe” prompts as unsafe. Prompt is the same for each column.
...and 5 more figures

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

TL;DR

Abstract

ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)