SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Guibao Shen; Luozhou Wang; Jiantao Lin; Wenhang Ge; Chaozhe Zhang; Xin Tao; Yuan Zhang; Pengfei Wan; Zhongyuan Wang; Guangyong Chen; Yijun Li; Ying-Cong Chen

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen

TL;DR

This work tackles the problem of false contextualization (relation leakage) in text-to-image diffusion caused by sequential text encoders. It introduces the Scene Graph Adapter (SG-Adapter), a transformer-based refinement placed after the CLIP encoder that uses scene graph triplets and a triplet-token attention mask $M^{\text{sg}}$ to align word embeddings with the correct subject–relation–object structures. A clean, multi-relational MultiRels dataset is proposed, along with three GPT-4V–derived metrics (SG-IoU, Entity-IoU, Relation-IoU) to measure image–scene-graph correspondence. Experimental results show SG-Adapter improves relation generation and correspondence while maintaining image quality, outperforming SG-to-image and baseline text-to-image methods. The approach enables more accurate control of complex relationships in diffusion-based generation and highlights the importance of high-quality, relation-rich data for multi-relational scene understanding.

Abstract

Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

TL;DR

to align word embeddings with the correct subject–relation–object structures. A clean, multi-relational MultiRels dataset is proposed, along with three GPT-4V–derived metrics (SG-IoU, Entity-IoU, Relation-IoU) to measure image–scene-graph correspondence. Experimental results show SG-Adapter improves relation generation and correspondence while maintaining image quality, outperforming SG-to-image and baseline text-to-image methods. The approach enables more accurate control of complex relationships in diffusion-based generation and highlights the importance of high-quality, relation-rich data for multi-relational scene understanding.

Abstract

Paper Structure (19 sections, 8 equations, 12 figures, 3 tables)

This paper contains 19 sections, 8 equations, 12 figures, 3 tables.

Introduction
Related Work
Proposed Method
Discussions of Causal Attention
Scene Graph Guided Generation
Experiments
Dataset Configuration
Baseline Methods
Qualitative Evaluation
Quantitative Evaluation
Ablation Study
SG-to-Image Generation Evaluation
Conclusion and Discussion
Appendix
Initial Experiment with Scene Graph Attention Mask
...and 4 more sections

Figures (12)

Figure 1: Overcoming Contextualization Limits in Image Generation with Scene Graph. The left section highlights the limitations of text embeddings in sequential text processing, showcasing how relations like "playing guitar" may erroneously apply to the "woman". The right section illustrates the improvements of using a Scene Graph, which provides structured clarity, enabling precise relation.
Figure 2: Framework for SG-Adapter in Stable Diffusion. The Parser (could be either an NLP toolfeng2022ssd or GPT-4achiam2023gpt) extracts linguistic structures from text inputs. Scene graph embeddings are computed as per Eq \ref{['sg-embedding']}. The token-triplet matrix, generated by the function $\tau$, guides the refinement of each token and its associated triplet. During testing, when integrated with our SG-Adapter, Stable Diffusion more accurately captures the intended semantic structure in the generated images.
Figure 3: Qualitative Comparisons with Adaptation Methods. In addition to precisely generating each individual relation in the text prompt, our SG-Adapter successfully creates all multiple relations together in correct correspondence.
Figure 4: Comparison with Scene Graph to Image Generation. SG-Adapter outperforms other SG generation methods in terms of image quality and relation accuracy.
Figure 5: Left: The same layout appears visually different due to different relationships. Right: Our method is also capable of learning the customized single relationship.
...and 7 more figures

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

TL;DR

Abstract

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (12)