Table of Contents
Fetching ...

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

Bowen Jin, Ziqi Pang, Bingjun Guo, Yu-Xiong Wang, Jiaxuan You, Jiawei Han

TL;DR

This paper proposes a graph context-conditioned diffusion model called InstructG2I, which exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features and proposes graph classifier-free guidance.

Abstract

In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at https://github.com/PeterGriffinJin/InstructG2I.

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

TL;DR

This paper proposes a graph context-conditioned diffusion model called InstructG2I, which exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features and proposes graph classifier-free guidance.

Abstract

In this paper, we approach an overlooked yet critical task Graph2Image: generating images from multimodal attributed graphs (MMAGs). This task poses significant challenges due to the explosion in graph size, dependencies among graph entities, and the need for controllability in graph conditions. To address these challenges, we propose a graph context-conditioned diffusion model called InstructG2I. InstructG2I first exploits the graph structure and multimodal information to conduct informative neighbor sampling by combining personalized page rank and re-ranking based on vision-language features. Then, a Graph-QFormer encoder adaptively encodes the graph nodes into an auxiliary set of graph prompts to guide the denoising process of diffusion. Finally, we propose graph classifier-free guidance, enabling controllable generation by varying the strength of graph guidance and multiple connected edges to a node. Extensive experiments conducted on three datasets from different domains demonstrate the effectiveness and controllability of our approach. The code is available at https://github.com/PeterGriffinJin/InstructG2I.

Paper Structure

This paper contains 25 sections, 21 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We propose a new task Graph2Image featuring image synthesis by conditioning on graph information and introduce a novel graph-conditioned diffusion model called InstructG2I to tackle this problem. (a) Graph2Image is supported by prevalent multimodal attributed graphs and is grounded in real-world applications, e.g., virtual artistry. (b) InstructG2I outperforms baseline image generation techniques, demonstrating the usefulness of graph information. (c) To accommodate realistic user queries, InstructG2I exhibits smooth controllability in utilizing text/graph information and managing the strength of multiple graph edges.
  • Figure 2: The overall framework of InstructG2I. (a) Given a target node with a text prompt (e.g., House in Snow) in a Multimodal Attributed Graph (MMAG) for which we want to generate an image, (b) we first perform semantic PPR-based neighbor sampling, which involves structure-aware personalized PageRank and semantic-aware similarity-based reranking to sample informative neighboring nodes in the graph. (c) These neighboring nodes are then inputted into a Graph-QFormer, encoded by multiple self-attention and cross-attention layers, represented as graph tokens and used to guide the denoising process of the diffusion model, together with text prompt tokens.
  • Figure 3: InstructG2I achieves the best trade-off between DINOv2 ($\uparrow$) and FID ($\downarrow$) scores.
  • Figure 4: Qualitative evaluation. Our method exhibits better consistency with the ground truth by better utilizing the graph information from neighboring nodes ("Sampled Neighbors" in the figure).
  • Figure 5: Ablation study on semantic PPR-based neighbor sampling. The results indicate that both structural and semantic relevance proposed by our method effectively improve the image generation quality and consistency with the graph context.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2