Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Yujia Yang; Yuanxiang Wang; Zhenyu Guan; Tiankun Yang; Chenxi Bao; Haopeng Jin; Jinwen Luo; Xinyu Zuo; Lisheng Duan; Haijin Liang; Jin Ma; Xinming Wang; Ruiwen Tao; Hongzhu Yi

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Tiankun Yang, Chenxi Bao, Haopeng Jin, Jinwen Luo, Xinyu Zuo, Lisheng Duan, Haijin Liang, Jin Ma, Xinming Wang, Ruiwen Tao, Hongzhu Yi

Abstract

While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Abstract

Paper Structure (38 sections, 6 equations, 9 figures, 11 tables)

This paper contains 38 sections, 6 equations, 9 figures, 11 tables.

Introduction
Related Work
Single-turn Evaluation Benchmarks:
Multi-turn Evaluation Benchmarks:
Omni IIE Bench Dataset
Data Acquisition and Sourcing
Data Generation and Annotation Pipeline
Annotation Types
Multi-Turn Diagnosis.
Generation and Annotation Process
Stage 1: Automated Candidate Data Generation
Stage 2: Automated Mask Generation
Stage 3: Manual Curation and Filtering
Data Statistics
Curation Pipeline and Source Distribution
...and 23 more sections

Figures (9)

Figure 1: Omni IIE Bench collects seed images from 12 datasets and uses GPT-4o to generate image descriptions. The descriptions are then randomly modified by GPT-4o. For single-turn dialogue generation, semantic scales are done at low and high levels; for multi-turn dialogue generation, modifications are randomly interleaved between the two levels. After that, the original descriptions, modified descriptions, and original images are input into Nano Banana for image generation, and the results are processed with GroundingDINO and SAM to obtain masks. Finally, all generated images and masks undergo strict manual review.
Figure 2: Overview of manual curation and filtering
Figure 3: Source Distribution of Datasets in Omni IIE Bench
Figure 4: Visualizing Single-Turn and Multi-Turn Consistency Across Multiple Models
Figure 5: Comparison of 8 IIE models across 8 metrics on Omni IIE Bench. The charts show normalized scores for (a) the Single-turn Consistency task and (b) the Multi-turn Coordination task.
...and 4 more figures

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Abstract

Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Authors

Abstract

Table of Contents

Figures (9)