Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models
Chen Wu, Fernando De la Torre
TL;DR
This paper tackles the challenge of fine-grained, disentangled control in text-to-image diffusion models. It introduces Contrastive Guidance, a two-prompt conditioning framework that yields a generative classifier-based score, with the gradient ∇_x log p_t(x|c) expressed as ∇_x log p_t(x) + λ_t ( ∇_x log p_t^+(x) − ∇_x log p_t^−(x) ) ≈ $s(x,t) + λ_t ( s_θ(x,t,y^+) - s_θ(x,t,y^−) )$, enabling targeted manipulation while suppressing unintended factors. The method is shown to be effective across three broad applications: guiding domain-specific diffusion experts, enabling continuous, rig-like control from text, and strengthening zero-shot image editing. Empirical results demonstrate improved disentanglement and performance over classifier-free guidance, with detailed analyses and practical guidance for implementation and efficiency through adaptive temperature and prompt-pair design.
Abstract
Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
