SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Hongxiang Zhang; Yifeng He; Hao Chen

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Hongxiang Zhang, Yifeng He, Hao Chen

TL;DR

SteerDiff introduces a lightweight embedding-space adaptor that intervenes before diffusion to steer unsafe prompts toward safe outputs without modifying diffusion model weights. By combining an Inappropriate Concepts Identifier with a learnable linear transformation, SteerDiff detects unsafe spans in prompt embeddings and projects them toward safety using the forward rule $e_{steered} = \epsilon \cdot W \cdot e_{unsafe} + (1 - \epsilon) \cdot e_{unsafe}$, trained to minimize $\|e_{safe} - W \cdot e_{unsafe}\|^2$. Trained on safety-focused corpora assembled with CoPro, Midjourney blacklists, and LLM-generated pairs, SteerDiff achieves state-of-the-art robustness against red-teaming while preserving image fidelity and text alignment on benchmarks like I2P and MS-COCO FID-30K. The approach also demonstrates versatility in artist-style removal tasks, suggesting practical, scalable deployment for safe text-conditioned image generation without costly model retraining or extensive data curation.

Abstract

Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

TL;DR

, trained to minimize

. Trained on safety-focused corpora assembled with CoPro, Midjourney blacklists, and LLM-generated pairs, SteerDiff achieves state-of-the-art robustness against red-teaming while preserving image fidelity and text alignment on benchmarks like I2P and MS-COCO FID-30K. The approach also demonstrates versatility in artist-style removal tasks, suggesting practical, scalable deployment for safe text-conditioned image generation without costly model retraining or extensive data curation.

Abstract

Paper Structure (42 sections, 4 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 4 equations, 9 figures, 6 tables, 1 algorithm.

Introduction
Methodology
Training Data Collection
Identifier Dataset
Steer Model Dataset
Inappropriate Concepts Identifier
Sliding Identification
Steering Toward Safe Content
Inference
Inference
Experiment
Benchmarks and Metrics
Inaproppriate Image Prompts (I2P)
MS-COCO FID-30K
Artist Style Removal
...and 27 more sections

Figures (9)

Figure 1: Overview of SteerDiff, a safety method designed to identify and steer inappropriate concepts toward producing safe images. The input prompt is first embedded by a text encoder. The identifier then checks if the prompt contains any inappropriate concepts $c' \in C_{\text{unsafe}}$. If detected, a linear transformation is applied to the prompt’s embedding to steer it toward safer content. This transformation adjusts the embedding space while preserving the semantics of the original prompt. Once transformed, the modified embedding is passed through the diffusion model to generate safe images.
Figure 2: Overview of SteerDiff (left). SteerDiff learns to distinguish safe and unsafe phrases (right).
Figure 3: SteerDiff successfully removes the targeted concepts "Van Gogh" and "Kelly McKernan" while preserving unrelated concepts such as "Pablo Picasso". The first row displays the original images generated by SD-v1.4, while the second row depicts steered samples generated from the same prompt.
Figure 4: SteerDiff successfully removes the targeted concepts "Van Gogh" and "Kelly McKernan" while preserving unrelated concepts such as "Pablo Picasso". The first row displays the original images generated by SD-v1.4, while the second row depicts steered samples generated from the same prompt.
Figure 5: Attack success rate comparison of SteerDiff and SLD MAX across different categories (lower the better defense performance).
...and 4 more figures

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

TL;DR

Abstract

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)