Table of Contents
Fetching ...

Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji, Chenyang Qi, Qifeng Chen

TL;DR

This work aims to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases and has competitive editing abilities on complex real-world images.

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

Instruction-based Image Editing with Planning, Reasoning, and Generation

TL;DR

This work aims to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases and has competitive editing abilities on complex real-world images.

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.
Paper Structure (21 sections, 5 equations, 7 figures, 3 tables)

This paper contains 21 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: We propose an instruction-based editing method with a Planning, Reasoning, and Generation framework that can edit the image with human language, empowered by the (multi-modal) large language model. Row 1 and Row 2 right: Our model could generate more fulfilling contents using instructions obtained by chain-of-thought; Row 2 left: Ours can further reason for the accurate editing region (shown at top right of sub-figures) based on the provided instructions.
  • Figure 2: Our Multi-modal Chain-of-Thought Editing framework executes image editing through three iterative stages, including planning, reasoning, and generation. In stage 1, a Chain-of-Thought Planner decomposes the user prompts to chain-structured refined editing sub-instructions; For each sub-instruction, an MLLM localizes target editing regions (stage 2) via cross-modal reasoning; Then, the conditional Diffusion model Edits the latest image (stage 3) while preserving non-target areas. The system cyclically refines outputs through location reasoning by MLLM and image generation by the Diffusion model until the original plan in stage 1 is completed.
  • Figure 3: (a) We trained a Multi-modality LLM that generates an editing region and enables better localization given the input image and sub-prompt. (b) Given the editing region and sub-prompt reasoned by M-LLM, we further train a conditional generative diffusion model to edit the image with better locality.
  • Figure 4: Examples of our method of the instruction-based image editing on MagicBrush Zhang2023MagicBrush. Editing regions reasoned by M-LLM are shown in the bottom left corner of our editing results. Under the examples, we show Chain-of-thought (CoT) planning, which helps to understand some concepts or break down the tasks.
  • Figure 5: Examples of our Multimodal Chain-of-Thought Editing Framework on HQ-Abstract. The abstract topic is 'dramatic". Our CoT planning with multimodal LLMs could instantiate the abstract instruction into more specific details. The editing area is shown in the bottom left of each image.
  • ...and 2 more figures