Table of Contents
Fetching ...

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

Yunnan Wang, Ziqiang Li, Zequn Zhang, Wenyao Zhang, Baao Xie, Xihui Liu, Wenjun Zeng, Xin Jin

TL;DR

This work proposes a Semantics-Layout Variational AutoEncoder (SL-VAE) and develops a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance.

Abstract

There has been exciting progress in generating images from natural language or layout conditions. However, these methods struggle to faithfully reproduce complex scenes due to the insufficient modeling of multiple objects and their relationships. To address this issue, we leverage the scene graph, a powerful structured representation, for complex image generation. Different from the previous works that directly use scene graphs for generation, we employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner, compositing diverse disentangled visual clues from scene graphs. Specifically, we first propose a Semantics-Layout Variational AutoEncoder (SL-VAE) to jointly derive (layouts, semantics) from the input scene graph, which allows a more diverse and reasonable generation in a one-to-many mapping. We then develop a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance. To further achieve graph manipulation while keeping the visual content consistent, we introduce a Multi-Layered Sampler (MLS) for an "isolated" image editing effect. Extensive experiments demonstrate that our method outperforms recent competitors based on text, layout, or scene graph, in terms of generation rationality and controllability.

Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation

TL;DR

This work proposes a Semantics-Layout Variational AutoEncoder (SL-VAE) and develops a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance.

Abstract

There has been exciting progress in generating images from natural language or layout conditions. However, these methods struggle to faithfully reproduce complex scenes due to the insufficient modeling of multiple objects and their relationships. To address this issue, we leverage the scene graph, a powerful structured representation, for complex image generation. Different from the previous works that directly use scene graphs for generation, we employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner, compositing diverse disentangled visual clues from scene graphs. Specifically, we first propose a Semantics-Layout Variational AutoEncoder (SL-VAE) to jointly derive (layouts, semantics) from the input scene graph, which allows a more diverse and reasonable generation in a one-to-many mapping. We then develop a Compositional Masked Attention (CMA) integrated with a diffusion model, incorporating (layouts, semantics) with fine-grained attributes as generation guidance. To further achieve graph manipulation while keeping the visual content consistent, we introduce a Multi-Layered Sampler (MLS) for an "isolated" image editing effect. Extensive experiments demonstrate that our method outperforms recent competitors based on text, layout, or scene graph, in terms of generation rationality and controllability.
Paper Structure (11 sections, 11 equations, 6 figures, 5 tables)

This paper contains 11 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Failure cases generated by (a) text-to-image (T2I) (DALL$\cdot$E 3 betker2023improving), (b) layout-to-image (L2I) (LayoutDiffusion zheng2023layoutdiffusion), and (c) semantics-based scene-graph-to-image (SG2I) (R3CD liu2024r3cd) methods . (d) Generalizable object Attribute Control (AC) under consistency achieved by our DisCo.
  • Figure 2: Comparison between the previous SG2I architectures and ours. (a) Layout-based SG2I model farshad2023scenegenie generate a spatial arrangement with an object layout; (b) Semantic-based SG2I models liu2024r3cdyang2022diffusion build interactive semantic embedding between objects; (c) Our method leverages scene graph representation by jointly deriving the disentangled layout and semantics with the proposed SL-VAE.
  • Figure 3: Framework overview. (I) We parameterize the node embeddings into the Gaussian distribution with the Graph Union Encoder, which jointly models the spatial relationships and non-spatial interactions in scene graphs; (II) The Semantic and Layout Decoders generate spatial layouts and interactive semantics sampled from Gaussian distribution, respectively; (III) A diffusion model with the proposed Compositional Masked Attention (CMA) incorporates object-level conditions to generate visual images following the scene graph description; (IV) Detailed structure of CMA Layer.
  • Figure 4: Toy example of (a) compositional masked attention, and (b) its corresponding attention mask. We use visual tokens and object embeddings of objects A and B for demonstration. A and B have 1 and 2 visual tokens, respectively, whose attribution is determined by bounding boxes.
  • Figure 5: Illustration of object-level Node Addition (NA) and Attribute Control (AC) in the scene. From left to right: (a) the image generated by the unmodified scene graph; (b) the chair addition; (c) the blue-colored wall; and (d) the red-colored wall.
  • ...and 1 more figures