Table of Contents
Fetching ...

Relation Rectification in Diffusion Model

Yinwei Wu, Xingyi Yang, Xinchao Wang

TL;DR

This work tackles the challenge of accurate relational reasoning in diffusion-based text-to-image synthesis by identifying that the end-of-text embedding $V_{eot}$ largely governs relationship semantics and is insufficiently discriminative for directional prompts. It introduces RRNet, a lightweight HGCN that models directed heterogeneous graphs derived from object–relation prompts and outputs an adjustment vector to $V_{eot}$, yielding a rectified embedding $V_{eot}^* = V_{eot} + \lambda h_{\Delta EOT}^{(L)}$ used by a frozen diffusion model. The model is trained with a positive denoise objective and a negative loss to distinguish OSPs while disentangling object semantics, and is evaluated on the Relation Rectification Benchmark, showing improved relationship accuracy and interpretability with a favorable speed–quality trade-off. The approach generalizes to unseen objects and maintains robust performance, offering a practical pathway to more reliable relation-aware image synthesis in real-world applications.

Abstract

Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/.

Relation Rectification in Diffusion Model

TL;DR

This work tackles the challenge of accurate relational reasoning in diffusion-based text-to-image synthesis by identifying that the end-of-text embedding largely governs relationship semantics and is insufficiently discriminative for directional prompts. It introduces RRNet, a lightweight HGCN that models directed heterogeneous graphs derived from object–relation prompts and outputs an adjustment vector to , yielding a rectified embedding used by a frozen diffusion model. The model is trained with a positive denoise objective and a negative loss to distinguish OSPs while disentangling object semantics, and is evaluated on the Relation Rectification Benchmark, showing improved relationship accuracy and interpretability with a favorable speed–quality trade-off. The approach generalizes to unseen objects and maintains robust performance, offering a practical pathway to more reliable relation-aware image synthesis in real-world applications.

Abstract

Despite their exceptional generative abilities, large text-to-image diffusion models, much like skilled but careless artists, often struggle with accurately depicting visual relationships between objects. This issue, as we uncover through careful analysis, arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this, we introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this, we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically, we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders, supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space. Crucially, our method retains the parameters of the text encoder and diffusion model, preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data, demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/.
Paper Structure (24 sections, 8 equations, 12 figures, 5 tables)

This paper contains 24 sections, 8 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Effect of masking out text embeddings corresponding to different tokens. We found that masking out the embedding of $[EOT]$ dramatically destroy the semantic of generated images, including relationships, whereas masking out the embeddings corresponds to words results in only marginal changes.
  • Figure 2: Our RRNet Architecture. Given OSPs and their exemplar images, the RRNet learns to produce adjustment vectors $h_{\Delta EOT}^{(L)}$, which will then be added on original $V_{eot}$ of the prompts. The rectified embeddings then will used as the condition to guidance the generation process of a frozen SD. The upper left part is the heterogeneous graph RRNet uses to model the relation direction. Upon optimization with negative loss and denoising loss, the SD will be able to generate images with correct relation direction.
  • Figure 3: Qualitative Results. By increasing the weight $\lambda$ of the adjustment vector, we show the process of correcting relation direction in the generated images that were originally incorrect.
  • Figure 4: Qualitative Comparisons. Our method outperforms the baselines in terms of the relationship generation.
  • Figure 5: Example of Generalization. By switching the objects in the prompts, RRNet can still generate correct relationships.
  • ...and 7 more figures