Table of Contents
Fetching ...

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models

Kota Sueyoshi, Takashi Matsubara

TL;DR

Diffusion models often misalign with complex text prompts due to neglect of logical relationships between terms. Predicated Diffusion introduces a unified framework that expresses prompt meaning as predicate-logic propositions and guides image synthesis by treating attention maps as fuzzy predicates, producing a differentiable loss that steers the diffusion process toward satisfying those propositions. By deriving losses for concurrent existence, one-to-one correspondence, and possession using first-order logic and product fuzzy logic, the method substantially improves fidelity and object-layout accuracy while preserving image quality across diverse prompts. The approach is generalizable to other backbones and can extend to additional logical statements, offering a principled direction for aligning text semantics with text-to-image diffusion models in practice.

Abstract

Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. For instance, a specified object may not be generated, an unnecessary object may be generated, and an adjective may alter objects it was not intended to modify. Moreover, we found that relationships indicating possession between objects are often overlooked. While users' intentions in text are diverse, existing methods tend to specialize in only some aspects of these. In this paper, we propose Predicated Diffusion, a unified framework to express users' intentions. We consider that the root of the above issues lies in the text encoder, which often focuses only on individual words and neglects the logical relationships between them. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that makes the image fulfill the proposition by minimizing it. When compared to several existing methods, we demonstrated that Predicated Diffusion can generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.

Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models

TL;DR

Diffusion models often misalign with complex text prompts due to neglect of logical relationships between terms. Predicated Diffusion introduces a unified framework that expresses prompt meaning as predicate-logic propositions and guides image synthesis by treating attention maps as fuzzy predicates, producing a differentiable loss that steers the diffusion process toward satisfying those propositions. By deriving losses for concurrent existence, one-to-one correspondence, and possession using first-order logic and product fuzzy logic, the method substantially improves fidelity and object-layout accuracy while preserving image quality across diverse prompts. The approach is generalizable to other backbones and can extend to additional logical statements, offering a principled direction for aligning text semantics with text-to-image diffusion models in practice.

Abstract

Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. For instance, a specified object may not be generated, an unnecessary object may be generated, and an adjective may alter objects it was not intended to modify. Moreover, we found that relationships indicating possession between objects are often overlooked. While users' intentions in text are diverse, existing methods tend to specialize in only some aspects of these. In this paper, we propose Predicated Diffusion, a unified framework to express users' intentions. We consider that the root of the above issues lies in the text encoder, which often focuses only on individual words and neglects the logical relationships between them. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that makes the image fulfill the proposition by minimizing it. When compared to several existing methods, we demonstrated that Predicated Diffusion can generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.
Paper Structure (43 sections, 8 equations, 13 figures, 10 tables)

This paper contains 43 sections, 8 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Visualizations of typical challenges in text-based image generation using diffusion models. The proposed Predicated Diffusion can address all of these challenges.
  • Figure 2: The conceptual diagram of the proposed Predicated Diffusion, composed of steps (1)--(6). One can make propositions manually or using a syntactic dependency parser.
  • Figure 3: Results of Experiments \ref{['ex:1']} for Concurrent Existence and \ref{['ex:2']} for One-to-One Correspondence.
  • Figure 5: Example results of Experiment \ref{['ex:3']} for possession. See also Fig. \ref{['fig:experiment3_additional']}.
  • Figure 6: Example results of Experiment \ref{['ex:4']} using prompts in ABC-6K. See also Figs. \ref{['fig:experiment4_additional1']} and \ref{['fig:experiment4_additional2']}.
  • ...and 8 more figures