Table of Contents
Fetching ...

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

Chen Wu, Fernando De la Torre

TL;DR

This paper tackles the challenge of fine-grained, disentangled control in text-to-image diffusion models. It introduces Contrastive Guidance, a two-prompt conditioning framework that yields a generative classifier-based score, with the gradient ∇_x log p_t(x|c) expressed as ∇_x log p_t(x) + λ_t ( ∇_x log p_t^+(x) − ∇_x log p_t^−(x) ) ≈ $s(x,t) + λ_t ( s_θ(x,t,y^+) - s_θ(x,t,y^−) )$, enabling targeted manipulation while suppressing unintended factors. The method is shown to be effective across three broad applications: guiding domain-specific diffusion experts, enabling continuous, rig-like control from text, and strengthening zero-shot image editing. Empirical results demonstrate improved disentanglement and performance over classifier-free guidance, with detailed analyses and practical guidance for implementation and efficiency through adaptive temperature and prompt-pair design.

Abstract

Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

TL;DR

This paper tackles the challenge of fine-grained, disentangled control in text-to-image diffusion models. It introduces Contrastive Guidance, a two-prompt conditioning framework that yields a generative classifier-based score, with the gradient ∇_x log p_t(x|c) expressed as ∇_x log p_t(x) + λ_t ( ∇_x log p_t^+(x) − ∇_x log p_t^−(x) ) ≈ , enabling targeted manipulation while suppressing unintended factors. The method is shown to be effective across three broad applications: guiding domain-specific diffusion experts, enabling continuous, rig-like control from text, and strengthening zero-shot image editing. Empirical results demonstrate improved disentanglement and performance over classifier-free guidance, with detailed analyses and practical guidance for implementation and efficiency through adaptive temperature and prompt-pair design.

Abstract

Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
Paper Structure (39 sections, 28 equations, 9 figures, 8 tables)

This paper contains 39 sections, 28 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: This paper explores disentangled control for text-to-image diffusion models. We show that using a pair of contrastive prompts can (1) guide domain experts (e.g., trained on an object class) with text, (2) enable rig-like control for text-to-image diffusion models (similar to steering a StyleGAN), and (3) improve text-to-image diffusion-based image editing methods.
  • Figure 2: Overview of Contrastive Guidance. Given a model to be guided, Contrastive Guidance use two prompts with minimal differences as guidance signal, which helps disentangle the intended image factors. $^{*}$The guided model can be a domain expert, the text-to-image model itself, or an image editing method (Section \ref{['subsec:application-method']}).
  • Figure 3: Contrastive Guidance improves disentanglement (e.g., attributes, background, foreground, and objects). Within each row, we fixed all random variables during the sampling process. The gray-scale images visualize the norm of the pixel-wise distance between the two images before and after the guidance.
  • Figure 4: Contrastive Guidance for continuous, rig-like controls, which cannot be described accurately by language. Results show that Contrastive Guidance can guide text-to-image diffusion models to generate images with continuously changed variations. The positive and baseline prompts are provided in Table \ref{['tab:rig-prompt-examples']}. Images are resized to $256 \times 256$, and please zoom in to see the details.
  • Figure 5: Contrastive Guidance mitigate contradiction. We focused on failure cases of the text-to-image diffusion model such as understanding negation (without eyeglasses, no umbrellas, without beard), relation (in water), and one unexpected contradiction (colored photo). The prompts used in this experiment are provided in Table \ref{['tab:hard-prompt-examples']}.
  • ...and 4 more figures