Table of Contents
Fetching ...

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

TL;DR

This work tackles the challenge of complex instruction-based image editing by integrating Multimodal Large Language Models with diffusion-based editing through a Bidirectional Interaction Module, enabling richer reasoning and image-text interaction. A two-stage training regime plus a novel data strategy that includes segmentation data and a synthetic complex-edit dataset enhances perception and reasoning capabilities, while Reason-Edit provides a targeted benchmark for evaluating such complex edits. Empirical results show SmartEdit surpasses prior methods on Reason-Edit in both understanding and reasoning scenarios, supported by qualitative analyses and a user-study indicating stronger alignment with instructions and higher perceived quality. The approach advances practical complex instruction-based image editing and highlights the importance of bidirectional multimodal interaction and carefully curated data for enabling advanced editing tasks.

Abstract

Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

TL;DR

This work tackles the challenge of complex instruction-based image editing by integrating Multimodal Large Language Models with diffusion-based editing through a Bidirectional Interaction Module, enabling richer reasoning and image-text interaction. A two-stage training regime plus a novel data strategy that includes segmentation data and a synthetic complex-edit dataset enhances perception and reasoning capabilities, while Reason-Edit provides a targeted benchmark for evaluating such complex edits. Empirical results show SmartEdit surpasses prior methods on Reason-Edit in both understanding and reasoning scenarios, supported by qualitative analyses and a user-study indicating stronger alignment with instructions and higher perceived quality. The approach advances practical complex instruction-based image editing and highlights the importance of bidirectional multimodal interaction and carefully curated data for enabling advanced editing tasks.

Abstract

Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper introduces SmartEdit, a novel approach to instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. However, direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this, we propose a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training, we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently, we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset, Reason-Edit, specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.
Paper Structure (23 sections, 4 equations, 17 figures, 4 tables)

This paper contains 23 sections, 4 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: We propose SmartEdit, an instruction-based image editing model that leverages Multimodal Large Language Models (MLLMs) to enhance the understanding and reasoning capabilities of instruction-based editing methods. With the specialized design, our SmartEdit is capable of handling complex understanding (the instructions that contain various object attributes like location, relative size, color, and in or outside the mirror) and reasoning scenarios.
  • Figure 2: For more complex instructions or scenarios, InstructPix2Pix fails to follow the instructions.
  • Figure 3: The overall framework of SmartEdit. For the instruction, we first append the $r$$[\mathrm{IMG}]$ tokens to the end of instruction $c$. Together with image $x$, they will be sent into LLaVA, which can then obtain the hidden states corresponding to these $r$$[\mathrm{IMG}]$ tokens. Then the hidden state is sent into the QFormer and gets feature $f$. Subsequently, the image feature $v$ output by the image encoder $E_{\phi}$ interacts with $f$ through a bidirectional interaction module (BIM), resulting in $f'$ and $v'$. The $f'$ and $v'$ are input into the diffusion models to achieve the instruction-based image editing task.
  • Figure 4: The network design of the BIM Module. In this module, the input information $f$ and $v$ will undergo bidirectional information interaction through different cross-attention.
  • Figure 5: Qualitative comparison on Reason-Edit. When compared to several existing instruction-based image editing methods that have undergone further fine-tuning on our synthetic editing dataset, our approach demonstrates superior editing capabilities in complex scenarios.
  • ...and 12 more figures