InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Zhiqiang Sheng; Xumeng Han; Zhiwei Zhang; Zenghui Xiong; Yifan Ding; Aoxiang Ping; Xiang Li; Tong Guo; Yao Mao

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Zhiqiang Sheng, Xumeng Han, Zhiwei Zhang, Zenghui Xiong, Yifan Ding, Aoxiang Ping, Xiang Li, Tong Guo, Yao Mao

TL;DR

This work introduces InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing, and proposes a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints.

Abstract

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

TL;DR

Abstract

Paper Structure (26 sections, 31 figures, 4 tables)

This paper contains 26 sections, 31 figures, 4 tables.

Introduction
Related Work
Instruction-Based Image Editing
Image Editing Benchmarks
InEdit-Bench
Benchmark Construction
Evaluation Metrics
Standard Visual Quality Metrics
Proposed Process-Oriented Metrics
Experiments
Experiments Setup
Result Analysis
Validity of LMM Scores
Conclusion
Overview of Supplementary Material
...and 11 more sections

Figures (31)

Figure 1: Comparison of previous image editing benchmarks and our proposed InEdit-Bench. (a) Previous Direct Editing Benchmark: Focuses on evaluating the model's ability to execute explicit instructions. (b) Previous Static Reasoning Editing Benchmark: Evaluates the model's ability to perform static reasoning in editing tasks with external knowledge. (c) Our Dynamics Reasoning Editing Benchmark (InEdit-Bench): A benchmark that requires dynamic knowledge reasoning and multi-step planning in editing tasks, aiming to assess the model's comprehensive ability to perform complex, non-direct image editing based on deep semantic understanding.
Figure 2: Overall introduction to InEdit-Bench. InEdit-Bench focuses on dynamic reasoning and multi-step editing modes, requiring models to generate intermediate logical pathways for given tasks. It spans 4 key domains: state transition, dynamic process, temporal sequence, and scientific simulation. The evaluation is conducted through 6 dimensions: appearance consistency, perceptual quality, semantic consistency, logical coherence, scientific plausibility, and process plausibility.
Figure 3: The task type distribution of InEdit-Bench. InEdit-Bench conducts a comprehensive evaluation of visual editing models across 16 sub-tasks under 4 domains.
Figure 4: The evaluation metrics of Logical Coherence, Scientific Plausibility, and Process Plausibility in InEdit-Bench. For each evaluation dimension, the evaluator model (GPT-4o-2024-11-20 gpt-4o-2024-11-20 in this study) analyzes various inputs based on carefully designed prompts and assigns corresponding scores for each sub-dimension.
Figure 5: Comparison of models across four fundamental tasks. For the dynamic process and scientific simulation tasks, scores in gray denote performance calculated without the scientific plausibility metric.
...and 26 more figures

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

TL;DR

Abstract

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Authors

TL;DR

Abstract

Table of Contents

Figures (31)