Table of Contents
Fetching ...

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, Dahua Lin

TL;DR

This work tackles the challenge of instruction-based image editing when instructions are implicit or underdefined, proposing ReasonPix2Pix to inject active reasoning into editing models. It introduces a reasoning-attentive dataset with implicit instructions, real fine-grained images, and larger input–edit variances, and pairs it with a simple multimodal LLM–diffusion framework that fuses image and instruction understanding. Empirical results show competitive performance on direct editing and significant improvements on reasoning-enabled edits, supported by ablations that demonstrate the importance of Part I–III data and MLLM integration. The approach enhances alignment with human intent and broadens the practical impact of AI-assisted image editing for more nuanced and context-aware applications.

Abstract

Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhibit a deficiency in executing active reasoning capacities required to comprehend instructions that are implicit or insufficiently defined. To enhance active reasoning capabilities and impart intelligence to the editing model, we introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset. The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images. When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not. The code will be available at https://github.com/Jin-Ying/ReasonPix2Pix.

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

TL;DR

This work tackles the challenge of instruction-based image editing when instructions are implicit or underdefined, proposing ReasonPix2Pix to inject active reasoning into editing models. It introduces a reasoning-attentive dataset with implicit instructions, real fine-grained images, and larger input–edit variances, and pairs it with a simple multimodal LLM–diffusion framework that fuses image and instruction understanding. Empirical results show competitive performance on direct editing and significant improvements on reasoning-enabled edits, supported by ablations that demonstrate the importance of Part I–III data and MLLM integration. The approach enhances alignment with human intent and broadens the practical impact of AI-assisted image editing for more nuanced and context-aware applications.

Abstract

Instruction-based image editing focuses on equipping a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhibit a deficiency in executing active reasoning capacities required to comprehend instructions that are implicit or insufficiently defined. To enhance active reasoning capabilities and impart intelligence to the editing model, we introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset. The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images. When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not. The code will be available at https://github.com/Jin-Ying/ReasonPix2Pix.
Paper Structure (27 sections, 9 equations, 15 figures, 4 tables)

This paper contains 27 sections, 9 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Generated results from the model trained on our dataset. Given an implicit instruction, our model can understand the instruction and then produce an appropriate edited image.
  • Figure 2: Previous method, InstructPix2Pix, is capable of tackling instruction "add a pair of sunglasses", but it generates absolutely wrong result for the instruction "she prefers face mask to sunglasses".
  • Figure 3: For the instruction "make it 50 years later", previous methods can make a young woman an old one, but cannot generate any results for fruit input (apple). In addition, when the input is a statue man, previous methods still make it old, which is wrong. The reasonable results may be an old woman, a rotted fruit, and a broken statue respectively. Therefore, these methods lack the capability of comprehending images with instruction.
  • Figure 4: One sample from InstructPix2Pix dataset. For each paired image, it contains 1) the input image and input caption, 2) the edited image and edited caption, and 3) instructions.
  • Figure 5: Reasoning Instruction Generation. We utilized GPT-3.5 to generate several candidate instructions according to the given input caption, edited caption, and instruction in the original dataset. Then GPT-3.5 selects the best instruction from these candidate instructions.
  • ...and 10 more figures