Table of Contents
Fetching ...

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen

TL;DR

This work tackles the problem of false contextualization (relation leakage) in text-to-image diffusion caused by sequential text encoders. It introduces the Scene Graph Adapter (SG-Adapter), a transformer-based refinement placed after the CLIP encoder that uses scene graph triplets and a triplet-token attention mask $M^{\text{sg}}$ to align word embeddings with the correct subject–relation–object structures. A clean, multi-relational MultiRels dataset is proposed, along with three GPT-4V–derived metrics (SG-IoU, Entity-IoU, Relation-IoU) to measure image–scene-graph correspondence. Experimental results show SG-Adapter improves relation generation and correspondence while maintaining image quality, outperforming SG-to-image and baseline text-to-image methods. The approach enables more accurate control of complex relationships in diffusion-based generation and highlights the importance of high-quality, relation-rich data for multi-relational scene understanding.

Abstract

Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

TL;DR

This work tackles the problem of false contextualization (relation leakage) in text-to-image diffusion caused by sequential text encoders. It introduces the Scene Graph Adapter (SG-Adapter), a transformer-based refinement placed after the CLIP encoder that uses scene graph triplets and a triplet-token attention mask to align word embeddings with the correct subject–relation–object structures. A clean, multi-relational MultiRels dataset is proposed, along with three GPT-4V–derived metrics (SG-IoU, Entity-IoU, Relation-IoU) to measure image–scene-graph correspondence. Experimental results show SG-Adapter improves relation generation and correspondence while maintaining image quality, outperforming SG-to-image and baseline text-to-image methods. The approach enables more accurate control of complex relationships in diffusion-based generation and highlights the importance of high-quality, relation-rich data for multi-relational scene understanding.

Abstract

Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.
Paper Structure (19 sections, 8 equations, 12 figures, 3 tables)

This paper contains 19 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overcoming Contextualization Limits in Image Generation with Scene Graph. The left section highlights the limitations of text embeddings in sequential text processing, showcasing how relations like "playing guitar" may erroneously apply to the "woman". The right section illustrates the improvements of using a Scene Graph, which provides structured clarity, enabling precise relation.
  • Figure 2: Framework for SG-Adapter in Stable Diffusion. The Parser (could be either an NLP toolfeng2022ssd or GPT-4achiam2023gpt) extracts linguistic structures from text inputs. Scene graph embeddings are computed as per Eq \ref{['sg-embedding']}. The token-triplet matrix, generated by the function $\tau$, guides the refinement of each token and its associated triplet. During testing, when integrated with our SG-Adapter, Stable Diffusion more accurately captures the intended semantic structure in the generated images.
  • Figure 3: Qualitative Comparisons with Adaptation Methods. In addition to precisely generating each individual relation in the text prompt, our SG-Adapter successfully creates all multiple relations together in correct correspondence.
  • Figure 4: Comparison with Scene Graph to Image Generation. SG-Adapter outperforms other SG generation methods in terms of image quality and relation accuracy.
  • Figure 5: Left: The same layout appears visually different due to different relationships. Right: Our method is also capable of learning the customized single relationship.
  • ...and 7 more figures