I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, Xiongkuo Min
TL;DR
I2I-Bench introduces a comprehensive, automated benchmark for image-to-image editing that covers both single-image and multi-image tasks with 10 prompt categories and 30 fine-grained evaluation dimensions. The framework employs a hybrid Specialist-Generalist evaluation pipeline, combining dedicated tools (OCR, segmentation, feature metrics) with large multimodal models to assess semantic alignment, fidelity, and physical plausibility, validated by large-scale human preference correlation. The paper demonstrates strong human-alignment, reveals key trade-offs and universal limitations in current editing models—especially in complex reasoning and cross-image consistency—and provides open-source components to accelerate future research. This benchmark aims to drive progress toward more capable, reliable, and interpretable image editing systems across diverse tasks and modalities.
Abstract
Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.
