Table of Contents
Fetching ...

I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, Xiongkuo Min

TL;DR

I2I-Bench introduces a comprehensive, automated benchmark for image-to-image editing that covers both single-image and multi-image tasks with 10 prompt categories and 30 fine-grained evaluation dimensions. The framework employs a hybrid Specialist-Generalist evaluation pipeline, combining dedicated tools (OCR, segmentation, feature metrics) with large multimodal models to assess semantic alignment, fidelity, and physical plausibility, validated by large-scale human preference correlation. The paper demonstrates strong human-alignment, reveals key trade-offs and universal limitations in current editing models—especially in complex reasoning and cross-image consistency—and provides open-source components to accelerate future research. This benchmark aims to drive progress toward more capable, reliable, and interpretable image editing systems across diverse tasks and modalities.

Abstract

Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

TL;DR

I2I-Bench introduces a comprehensive, automated benchmark for image-to-image editing that covers both single-image and multi-image tasks with 10 prompt categories and 30 fine-grained evaluation dimensions. The framework employs a hybrid Specialist-Generalist evaluation pipeline, combining dedicated tools (OCR, segmentation, feature metrics) with large multimodal models to assess semantic alignment, fidelity, and physical plausibility, validated by large-scale human preference correlation. The paper demonstrates strong human-alignment, reveals key trade-offs and universal limitations in current editing models—especially in complex reasoning and cross-image consistency—and provides open-source components to accelerate future research. This benchmark aims to drive progress toward more capable, reliable, and interpretable image editing systems across diverse tasks and modalities.

Abstract

Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

Paper Structure

This paper contains 85 sections, 5 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: An overview of the proposed image-to-image editing evaluation benchmark suite, I2I-Bench. The process starts with our large-scale Prompt Suite, which defines the editing tasks. These prompts are fed into the Editing Model to edit images. The prompts also guide the selection of relevant dimensions from our hierarchical Evaluation Dimension Suite. Each dimension, in turn, specifies both the automated Evaluation Method Suite (combining Specialists and Generalists) and the criteria for Human Annotation. Finally, the results from the automated methods and human annotations are compared for Alignment Verification to ensure the reliability of our benchmark.
  • Figure 2: Visualization of the 10 task categories in the I2I-Bench Prompt Suite. The left half shows 5 single-image editing (SE) tasks, from “Object Manipulation” to “World Knowledge & Reasoning”. The right half shows 5 multi-image editing (ME) tasks, illustrating increasing complexity from “Basic Combination” to “Combination + Reasoning”.
  • Figure 3: Capability radar charts for the evaluated models on key dimensions. (a) Foundational Quality & Fidelity (SE models). (b) Task Execution & Advanced Capabilities (SE models). (c) Foundational Quality & Fidelity (ME models). (d) Task Execution & Advanced Capabilities (ME models).
  • Figure 4: Performances of top-performing SE and ME models on common dimensions across task categories. (1) The performance of Qwen-Image-Edit-2509 (SE) as task cognitive complexity increases. (2) The performance of nano-banana (ME) varies across complex combination tasks.
  • Figure 5: Performance comparison between Single-Image Editing (SE) and Multi-Image Editng (ME) tasks for Qwen-Image-Edit-2509 and Omnigen2 on shared dimensions.
  • ...and 10 more figures