Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji; Chenyang Qi; Qifeng Chen

Instruction-based Image Editing with Planning, Reasoning, and Generation

Liya Ji, Chenyang Qi, Qifeng Chen

TL;DR

This work aims to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases and has competitive editing abilities on complex real-world images.

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

Instruction-based Image Editing with Planning, Reasoning, and Generation

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 7 figures, 3 tables)

This paper contains 21 sections, 5 equations, 7 figures, 3 tables.

Introduction
Related Work
Instruction-based Image Editing
Multi-Modality LLMs for Vision Tasks
Controllable Generation in Diffusion Models
Method
Image Editing with Multi-modal Chain-of-Thought Prompts
Editing Region Reasoning
Hint-guided Editing Network
Classifier-free Guidance for Three Conditions
Experiments
Datasets and Pretrained Models
Baselines and Metrics
Implementation details
Experimental Results
...and 6 more sections

Figures (7)

Figure 1: We propose an instruction-based editing method with a Planning, Reasoning, and Generation framework that can edit the image with human language, empowered by the (multi-modal) large language model. Row 1 and Row 2 right: Our model could generate more fulfilling contents using instructions obtained by chain-of-thought; Row 2 left: Ours can further reason for the accurate editing region (shown at top right of sub-figures) based on the provided instructions.
Figure 2: Our Multi-modal Chain-of-Thought Editing framework executes image editing through three iterative stages, including planning, reasoning, and generation. In stage 1, a Chain-of-Thought Planner decomposes the user prompts to chain-structured refined editing sub-instructions; For each sub-instruction, an MLLM localizes target editing regions (stage 2) via cross-modal reasoning; Then, the conditional Diffusion model Edits the latest image (stage 3) while preserving non-target areas. The system cyclically refines outputs through location reasoning by MLLM and image generation by the Diffusion model until the original plan in stage 1 is completed.
Figure 3: (a) We trained a Multi-modality LLM that generates an editing region and enables better localization given the input image and sub-prompt. (b) Given the editing region and sub-prompt reasoned by M-LLM, we further train a conditional generative diffusion model to edit the image with better locality.
Figure 4: Examples of our method of the instruction-based image editing on MagicBrush Zhang2023MagicBrush. Editing regions reasoned by M-LLM are shown in the bottom left corner of our editing results. Under the examples, we show Chain-of-thought (CoT) planning, which helps to understand some concepts or break down the tasks.
Figure 5: Examples of our Multimodal Chain-of-Thought Editing Framework on HQ-Abstract. The abstract topic is 'dramatic". Our CoT planning with multimodal LLMs could instantiate the abstract instruction into more specific details. The editing area is shown in the bottom left of each image.
...and 2 more figures

Instruction-based Image Editing with Planning, Reasoning, and Generation

TL;DR

Abstract

Instruction-based Image Editing with Planning, Reasoning, and Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)