Table of Contents
Fetching ...

Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs

Rameshwar Mishra, A V Subramanyam

TL;DR

This work addresses the challenge of generating images from scene graphs without predicting intermediate layouts by leveraging CLIP-guided diffusion with a graph-conditioned signal. A graph encoder produces CLIP-aligned scene-graph embeddings that, when fused with per-object CLIP label embeddings, form $S_{cond}$ to fine-tune a pre-trained diffusion model; training optimizes a reconstruction loss together with an alignment loss that aligns graph features to CLIP space via $\mathcal{L}_{train} = \lambda \mathcal{L}_{recon} + (1-\lambda) \mathcal{L}_{align}$, where $\mathcal{L}_{align} = \beta \mathcal{L}_{CLIP} + (1-\beta) \mathcal{L}_{MMD}$. A GAN-based CLIP alignment module ensures $G_{global}^s$ matches CLIP features, enabling effective conditioning for diffusion-based image synthesis. Empirically, the method achieves state-of-the-art results on Visual Genome and COCO-stuff, improving FID and IS while enhancing diversity (DS) and object-coverage (OOR). Overall, this approach demonstrates that graph-aware CLIP guidance can enable accurate, diverse, and graph-consistent image generation from complex scene graphs.

Abstract

Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.

Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs

TL;DR

This work addresses the challenge of generating images from scene graphs without predicting intermediate layouts by leveraging CLIP-guided diffusion with a graph-conditioned signal. A graph encoder produces CLIP-aligned scene-graph embeddings that, when fused with per-object CLIP label embeddings, form to fine-tune a pre-trained diffusion model; training optimizes a reconstruction loss together with an alignment loss that aligns graph features to CLIP space via , where . A GAN-based CLIP alignment module ensures matches CLIP features, enabling effective conditioning for diffusion-based image synthesis. Empirically, the method achieves state-of-the-art results on Visual Genome and COCO-stuff, improving FID and IS while enhancing diversity (DS) and object-coverage (OOR). Overall, this approach demonstrates that graph-aware CLIP guidance can enable accurate, diverse, and graph-consistent image generation from complex scene graphs.

Abstract

Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
Paper Structure (18 sections, 16 equations, 4 figures, 2 tables)

This paper contains 18 sections, 16 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Samples generated using our architecture. It can be seen that the generated images reflect the structure given in the scene graph. They are different from ground truth images illustrating the diversity in generated samples.
  • Figure 2: Overview of the proposed architecture. Graph encoder gives CLIP aligned graph embedding. This embedding is fused with semantic label embedding of objects present in the scene graph. The fused embedding forms a conditioning signal for diffusion model. During training, we pass this conditional signal with noise added input image and guide the training using reconstruction and CLIP alignment loss. During sampling we pass this conditioning signal with noise to generate image corresponding to the input scene graph.
  • Figure 3: Graph embedding is aligned with CLIP visual features of the corresponding image. Alignment is achieved using GAN-based architecture.
  • Figure 4: Qualitative comparison of $(256 \times 256)$ images generated by various publicly available scene graph to image models. All given input graphs corresponding to ground truth images are perturbed slightly to check effectiveness of each methods. The last columns shows images generated by our method.