Table of Contents
Fetching ...

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Zhiyuan Zhang, DongDong Chen, Jing Liao

TL;DR

A new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing that significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.

Abstract

Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility. Leveraging this benefit, we introduce a new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing. This integration enables precise modifications at the object level and creative recomposition of scenes without compromising overall image integrity. Our approach involves two primary stages: 1) Utilizing a LLM-driven scene parser, we construct an image's scene graph, capturing key objects and their interrelationships, as well as parsing fine-grained attributes such as object masks and descriptions. These annotations facilitate concept learning with a fine-tuned diffusion model, representing each object with an optimized token and detailed description prompt. 2) During the image editing phase, a LLM editing controller guides the edits towards specific areas. These edits are then implemented by an attention-modulated diffusion editor, utilizing the fine-tuned model to perform object additions, deletions, replacements, and adjustments. Through extensive experiments, we demonstrate that our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

TL;DR

A new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing that significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.

Abstract

Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility. Leveraging this benefit, we introduce a new framework that integrates large language model (LLM) with Text2Image generative model for scene graph-based image editing. This integration enables precise modifications at the object level and creative recomposition of scenes without compromising overall image integrity. Our approach involves two primary stages: 1) Utilizing a LLM-driven scene parser, we construct an image's scene graph, capturing key objects and their interrelationships, as well as parsing fine-grained attributes such as object masks and descriptions. These annotations facilitate concept learning with a fine-tuned diffusion model, representing each object with an optimized token and detailed description prompt. 2) During the image editing phase, a LLM editing controller guides the edits towards specific areas. These edits are then implemented by an attention-modulated diffusion editor, utilizing the fine-tuned model to perform object additions, deletions, replacements, and adjustments. Through extensive experiments, we demonstrate that our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.

Paper Structure

This paper contains 31 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Our pipeline consists of two main stages: scene parsing and image editing. In the scene parsing stage, the input image is processed by our LLM-driven scene parser, which creates a scene graph and annotations for nodes such as object masks and captions. The node annotations allow for the fine-tuning of the diffusion model, representing each object in the scene with an optimized token and a specific prompt. During the image editing stage, the LLM editing controller translates user manipulations on the scene graph into a sequence of operations with text prompts and directs the targeted edits to specific regions. These edits are implemented by applying attention modulation to the fine-tuned diffusion model, enabling object additions, removals, replacements, and relationship modifications in the scene. The input image is from © iStockphoto.
  • Figure 2: Screenshot of our interface. The input image is from © iStockphoto.
  • Figure 3: Contents represented by the detailed description and optimized token. The leftmost column shows the input images, while the three right columns display images generated using the optimized token, detailed description, and their combination. The combination best preserves the woman's visual identity. The input image is from © Unsplash.
  • Figure 4: The illustration of our Attention Modulated Object Removal and Insertion. The left part shows the attention modulation in self-attention for object removal, and the right part shows the attention modulation in both self and cross-attention for object insertion.
  • Figure 5: Qualitative comparison with other baseline methods. From left to right: (a) Input images; (b) Scene graphs and user edits; (c) SIMSG dhamo2020semantic; (d) SGDiff yang2022diffusion; (e) Break-a-scene avrahami2023break; (f) InstructPix2Pix brooks_instructpix2pix_2023; (g) Ours. Input images: the 1st, 4th, and 6th rows are from © iStockphoto; the 2nd, 3rd, 5th, 7th, and 8th rows are from © Unsplash.
  • ...and 9 more figures