Table of Contents
Fetching ...

InsightEdit: Towards Better Instruction Following for Image Editing

Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, Qiang Liu

TL;DR

This work curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency, and introduces a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM).

Abstract

In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.

InsightEdit: Towards Better Instruction Following for Image Editing

TL;DR

This work curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency, and introduces a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM).

Abstract

In this paper, we focus on the task of instruction-based image editing. Previous works like InstructPix2Pix, InstructDiffusion, and SmartEdit have explored end-to-end editing. However, two limitations still remain: First, existing datasets suffer from low resolution, poor background consistency, and overly simplistic instructions. Second, current approaches mainly condition on the text while the rich image information is underexplored, therefore inferior in complex instruction following and maintaining background consistency. Targeting these issues, we first curated the AdvancedEdit dataset using a novel data construction pipeline, formulating a large-scale dataset with high visual quality, complex instructions, and good background consistency. Then, to further inject the rich image information, we introduce a two-stream bridging mechanism utilizing both the textual and visual features reasoned by the powerful Multimodal Large Language Models (MLLM) to guide the image editing process more precisely. Extensive results demonstrate that our approach, InsightEdit, achieves state-of-the-art performance, excelling in complex instruction following and maintaining high background consistency with the original image.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We propose InsightEdit, an end-to-end instruction-based image editing model, trained on high-quality data and designed to fully harness the capabilities of Multimodal Large Language Models (MLLM), achieving high-quality edits with strong instruction-following and background consistency.
  • Figure 2: The overall data construction pipeline.(1) Captioning & Object Extraction: Utilizing VLM to generate a global caption from the source image, and further get an object JSON list contains both simple caption and detailed caption. (2) Mask Generation: Utilizing GroundedSAM to obtain the corresponding mask of each object. (3) Editing Pair Construction: Utilizing mask-based image editing model to construct target image and templated instruction. (4) Instruction Recaptioning: Utilizing VLM to rewrite instruction to gain diverse instructions. (5) Quality Evaluation: Filtering the datasets using VIEScore.
  • Figure 3: The overall architecture of InsightEdit. It mainly consists of three parts: (1) Comprehension Module: A comprehension module that leverages MLLM to perceive and comprehend the image editing task; (2) Bridging Module: A bridging module that better interacts and extracts both the textual and image features; (3) Generation Module: A generation module that receives editing guidance via diffusion model to generate the target image.
  • Figure 4: Qualitative comparison on AdvancedEdit. InsightEdit shows superior instruction following and background consistency capability.
  • Figure 5: Demonstration of the effectiveness of the IAA module.
  • ...and 1 more figures