Table of Contents
Fetching ...

LogicalDefender: Discovering, Extracting, and Utilizing Common-Sense Knowledge

Yuhe Liu, Mengxue Kang, Zengchang Qin, Xiangxiang Chu

TL;DR

This work addresses the gap in text-to-image systems' ability to respect deep-seated logical relations in scene composition. It introduces LogicalDefender, a framework that learns a dedicated logical embedding by fusing human-summarized common-sense knowledge with illustrative images, and enhances learning through a negative-parallel training path that suppresses non-logical features. Using Latent Diffusion Models, initialization tokens derived from LLM descriptions, and carefully designed prompts, the approach achieves improved logical coherence (e.g., correct attachment of fruit stems to trees) while preserving fidelity, with demonstrated generalization to unseen fruits. The method offers a practical, low-cost path to integrating structured commonsense reasoning into image synthesis, with broad potential for advancing controllable and explainable generative systems.

Abstract

Large text-to-image models have achieved astonishing performance in synthesizing diverse and high-quality images guided by texts. With detail-oriented conditioning control, even finer-grained spatial control can be achieved. However, some generated images still appear unreasonable, even with plentiful object features and a harmonious style. In this paper, we delve into the underlying causes and find that deep-level logical information, serving as common-sense knowledge, plays a significant role in understanding and processing images. Nonetheless, almost all models have neglected the importance of logical relations in images, resulting in poor performance in this aspect. Following this observation, we propose LogicalDefender, which combines images with the logical knowledge already summarized by humans in text. This encourages models to learn logical knowledge faster and better, and concurrently, extracts the widely applicable logical knowledge from both images and human knowledge. Experiments show that our model has achieved better logical performance, and the extracted logical knowledge can be effectively applied to other scenarios.

LogicalDefender: Discovering, Extracting, and Utilizing Common-Sense Knowledge

TL;DR

This work addresses the gap in text-to-image systems' ability to respect deep-seated logical relations in scene composition. It introduces LogicalDefender, a framework that learns a dedicated logical embedding by fusing human-summarized common-sense knowledge with illustrative images, and enhances learning through a negative-parallel training path that suppresses non-logical features. Using Latent Diffusion Models, initialization tokens derived from LLM descriptions, and carefully designed prompts, the approach achieves improved logical coherence (e.g., correct attachment of fruit stems to trees) while preserving fidelity, with demonstrated generalization to unseen fruits. The method offers a practical, low-cost path to integrating structured commonsense reasoning into image synthesis, with broad potential for advancing controllable and explainable generative systems.

Abstract

Large text-to-image models have achieved astonishing performance in synthesizing diverse and high-quality images guided by texts. With detail-oriented conditioning control, even finer-grained spatial control can be achieved. However, some generated images still appear unreasonable, even with plentiful object features and a harmonious style. In this paper, we delve into the underlying causes and find that deep-level logical information, serving as common-sense knowledge, plays a significant role in understanding and processing images. Nonetheless, almost all models have neglected the importance of logical relations in images, resulting in poor performance in this aspect. Following this observation, we propose LogicalDefender, which combines images with the logical knowledge already summarized by humans in text. This encourages models to learn logical knowledge faster and better, and concurrently, extracts the widely applicable logical knowledge from both images and human knowledge. Experiments show that our model has achieved better logical performance, and the extracted logical knowledge can be effectively applied to other scenarios.
Paper Structure (19 sections, 10 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: The images illustrate the logical performance of various types of objects. The first row represents the initial results, referring to direct inference from SD1.5 LDM. The second row depicts the rules' results, referring to direct inference from SD1.5 with prompt adjustment. The third row represents results from our method, LogicalDefender. The results demonstrate that our method significantly outperforms the other two in terms of logical accuracy. As for apples and pears, the stem in our version is attached to the tree, unlike the first two images where it points outward. As for cherries, our cherries are paired together, attached to the tree by a branch, not simply hanging on the tree as in the first two images. As for lemons, our lemons are connected to the tree via a stem, unlike the first two images where it is suspended in mid-air without connection.
  • Figure 2: Framework of Negative-Parellel training path. Note that the only trainable parameter in $\mathbf{\theta}$ is our logical embedding, which is under the placeholder string $\left[ V\right]$.
  • Figure 3: Generalization. The figure displays the effective generalization of logical information. Trained on images of apples, cherries, lemons, and pears, our model successfully extrapolates this learning to images of durians, mangoes, peaches, and oranges. All of these fruits as connected to the tree via their stems. However, when inference was made using original SD1.5 model, it resulted in images where fruits appeared to be floating, with no connection to the tree.
  • Figure 4: Num of classes. "2 classes * 3 imgs" represents learning from both lemons and apples, while "1 class * 6 imgs" involves learning solely from apple images. The results suggest that increased category diversity improves logical learning and shape preservation. However, while this method enhances apple image outcomes, it lacks effective generalization to other fruits.
  • Figure 5: Num of images. Learning from three images of each of four fruit types ("4 classes * 3 imgs") enhances our ability to understand logical information, as evidenced by the learned connection between the object and branches in the lemon and cherry images. This learning outcome is not achieved when only one image per fruit type is used ("4 class * 1 img").
  • ...and 7 more figures