Table of Contents
Fetching ...

$\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

TL;DR

Complex-Edit introduces a complexity-controllable image editing benchmark and a three-stage Chain-of-Edit data-collection pipeline that uses GPT-4o to generate, simplify, and compound atomic edits into complex instructions. It pairs this with a VLM-based autograding framework that decomposes evaluation into Instruction Following, Identity Preservation, and Perceptual Quality, and analyzes scoring strategies, rubrics, and meta-evaluation to align with human judgments. Through extensive experiments across open-source and proprietary models and both real and synthetic imagery, the work reveals that performance drops as instruction complexity grows, that Best-of-N and direct editing outperform sequential editing, and that the so-called curse of synthetic data emerges in synthetic and GPT-4o outputs. The benchmark offers a scalable, interpretable platform for assessing test-time scaling in image editing systems and highlights critical factors shaping realistic, instruction-driven image manipulations. This work advances the evaluation methodology for complex, instruction-guided edits and provides a resource for developing next-generation image editing systems with robust, scalable inference capabilities.

Abstract

We introduce $\texttt{Complex-Edit}$, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

$\texttt{Complex-Edit}$: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

TL;DR

Complex-Edit introduces a complexity-controllable image editing benchmark and a three-stage Chain-of-Edit data-collection pipeline that uses GPT-4o to generate, simplify, and compound atomic edits into complex instructions. It pairs this with a VLM-based autograding framework that decomposes evaluation into Instruction Following, Identity Preservation, and Perceptual Quality, and analyzes scoring strategies, rubrics, and meta-evaluation to align with human judgments. Through extensive experiments across open-source and proprietary models and both real and synthetic imagery, the work reveals that performance drops as instruction complexity grows, that Best-of-N and direct editing outperform sequential editing, and that the so-called curse of synthetic data emerges in synthetic and GPT-4o outputs. The benchmark offers a scalable, interpretable platform for assessing test-time scaling in image editing systems and highlights critical factors shaping realistic, instruction-driven image manipulations. This work advances the evaluation methodology for complex, instruction-guided edits and provides a resource for developing next-generation image editing systems with robust, scalable inference capabilities.

Abstract

We introduce , a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

Paper Structure

This paper contains 46 sections, 2 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: An illustration of our Complex-Edit Benchmark. This figure presents a structured progression of instruction complexity in image editing tasks, highlighting the transition from atomic edits to highly intricate transformations.
  • Figure 2: An overview of our data collection pipeline. The pipeline consists of three distinct stages: 1) Stage #1 Sequence Generation: for each image, a series of atomic instructions is produced; 2) Stage #2 Simplification: each fundamental instruction is refined to eliminate extraneous details, preserving only the essential description of the editing process; 3) Stage #3 Instruction Compounding: several atomic instructions are integrated into one comprehensive instruction.
  • Figure 3:
  • Figure 4: An illustration of 24 types of atomic editing operations in 9 categories.
  • Figure 5: Examples of evaluation results for Instruction Following and Identity Preservation.
  • ...and 17 more figures