Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Senmao Li; Joost van de Weijer; Taihang Hu; Fahad Shahbaz Khan; Qibin Hou; Yaxing Wang; Jian Yang

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang

TL;DR

The paper tackles the challenge of suppressing undesired content in text-to-image diffusion by manipulating text embeddings rather than fine-tuning generators. It introduces soft-weighted regularization to weaken negative information embedded in [EOT] tokens and a subsequent inference-time embedding optimization that preserves the positive target while further suppressing the negative content through attention-based losses. The approach demonstrates improved suppression performance across generated and real images, generalizes to both Stable Diffusion and DeepFloyd-IF, and remains model-agnostic without requiring paired data. A notable limitation is the runtime cost of the inference-time optimization, which the authors acknowledge and suggest could be reduced with engineering efforts.

Abstract

The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

TL;DR

Abstract

and

. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).

Paper Structure (15 sections, 6 equations, 33 figures, 6 tables)

This paper contains 15 sections, 6 equations, 33 figures, 6 tables.

Introduction
Related work
Method
Preliminary: Diffusion Model
Analysis of [EOT] embeddings
Text embedding-based Semantic Suppression
Inference-time text embedding optimization
Experiments
Conclusions and Limitations
Appendix: Implementation Details
Appendix: Eq. \ref{['eq:ne_eot_recon']} in Soft-weighted Regularization.
Appendix: Algorithm detail of generated image.
Appendix: Ablation analysis
Appendix: Additional results
Appendix: Additional applications

Figures (33)

Figure 1: Failure cases of Stable Diffusion (SD) and DeepFloyd-IF. Given the prompt "A man without glasses", both SD and DeepFloyd-IF fail to suppress the generation of negative target glasses. Our method successfully removes the "glasses". (Right) we use DetScore (see Sec. \ref{['sec:experimental_setup']}) to detect the "glasses" from 1000 generated images. The DetScore of SD with prompt "A face without glasses" is 0.122. See Appendix \ref{['sec:additional_results']} for additional examples.
Figure 2: Analysis of [EOT] embeddings. (a) [EOT] embeddings contain significant information as can be seen when zeroed out. (b) when performing WNNM gu2014weighted, we find that [EOT] embeddings have redundant semantic information. (c) distance matrix between all text embeddings. Note that each [EOT] embedding contains similar semantic information and they have near zero distance.
Figure 3: Overview of the proposed method. (a) We devise a negative target embedding matrix $\boldsymbol\chi$: $\boldsymbol\chi = [\boldsymbol{c}^{NE},\boldsymbol{c}^{EOT}_0, \cdots, \boldsymbol{c}^{EOT}_{N-{|\boldsymbol{p}|-2}}]$. We perform SVD for the embedding matrix $\boldsymbol\chi=\textbf{U}{\boldsymbol\Sigma}{\textbf{V}}^T$. We introduce a soft-weight regularization for each largest eigenvalue. Then we recover the embedding matrix $\hat{\boldsymbol\chi}=\textbf{U}{\hat{\boldsymbol\Sigma}}{\textbf{V}}^T$. (b) We propose inference-time text embedding optimization (ITO). We align the attention maps of both $\boldsymbol{c}^{PE}$ and $\boldsymbol{\hat{c}}^{PE}$, and widen the ones of both $\boldsymbol{c}^{NE}$ and $\boldsymbol{\hat{c}}^{NE}$.
Figure 4: Our algorithm
Figure 5: Effect of resetting top-K or bottom-K singular values to 0. Main singular values correspond to the target information that we expect to be suppressed.
...and 28 more figures

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

TL;DR

Abstract

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (33)