Table of Contents
Fetching ...

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, Rongrong Ji

TL;DR

I2EBench is proposed, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions, providing a comprehensive assessment of each IIE model.

Abstract

Significant progress has been made in the field of Instruction-based Image Editing (IIE). However, evaluating these models poses a significant challenge. A crucial requirement in this field is the establishment of a comprehensive evaluation benchmark for accurately assessing editing results and providing valuable insights for its further development. In response to this need, we propose I2EBench, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension. 3) Valuable Research Insights: By analyzing the advantages and disadvantages of existing IIE models across the 16 dimensions, we offer valuable research insights to guide future development in the field. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: https://github.com/cocoshe/I2EBench.

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

TL;DR

I2EBench is proposed, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions, providing a comprehensive assessment of each IIE model.

Abstract

Significant progress has been made in the field of Instruction-based Image Editing (IIE). However, evaluating these models poses a significant challenge. A crucial requirement in this field is the establishment of a comprehensive evaluation benchmark for accurately assessing editing results and providing valuable insights for its further development. In response to this need, we propose I2EBench, a comprehensive benchmark designed to automatically evaluate the quality of edited images produced by IIE models from multiple dimensions. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. It offers three distinctive characteristics: 1) Comprehensive Evaluation Dimensions: I2EBench comprises 16 evaluation dimensions that cover both high-level and low-level aspects, providing a comprehensive assessment of each IIE model. 2) Human Perception Alignment: To ensure the alignment of our benchmark with human perception, we conducted an extensive user study for each evaluation dimension. 3) Valuable Research Insights: By analyzing the advantages and disadvantages of existing IIE models across the 16 dimensions, we offer valuable research insights to guide future development in the field. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models. The code, dataset and generated images from all IIE models are provided in github: https://github.com/cocoshe/I2EBench.
Paper Structure (13 sections, 1 equation, 7 figures, 2 tables)

This paper contains 13 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of I$^2$EBench, an automated system for evaluating the quality of editing results generated by instruction-based image editing (IIE) models. We collected a dataset of over 2000+ images from public datasets lin2014microsoftguo2023skyMartinFTM01chen2021allancuti2019denseliu2021syntheticLiu_2021_WACVqu2017deshadownetNah_2017_CVPRshen2019humanwei2018deep and annotated them with corresponding original editing instructions. To diversify the instructions, we used ChatGPT achiam2023gpt to generate varied versions. With the collected images and the original/diverse editing instructions, we utilized existing IIE models to generate edited images. Subsequently, we developed an evaluation methodology to automatically assess the adherence of edited images to the provided instructions under different dimensions. We also implemented human evaluation to obtain human preferences for editing results of different IIE models. Finally, we analyzed the correlation between automated evaluation and human evaluation, confirming alignment with human perception.
  • Figure 2: Visualization of the editing results on the proposed 16 evaluation dimensions using different IIE models, including InstructAny2Pix li2023instructany2pix, HIVE zhang2023hive, InstructEdit wang2023instructedit, InstructDiffusion geng2023instructdiffusion, InstructPix2Pix brooks2023instructpix2pix, MagicBrush zhang2024magicbrush, MGIE fu2023guiding, and HQEdit hui2024hq. A detailed version can be found in supplementary materials.
  • Figure 3: Word cloud visualization (a,b) and image quantity statistics (c) of I$^2$EBench.
  • Figure 4: Comparison of radar charts for I$^2$EBench scores in different dimensions using (a) original instructions and (b) diverse instructions.
  • Figure 5: Alignment between I$^2$EBench rank scores (Y-axis) and human scores (X-axis).
  • ...and 2 more figures