Table of Contents
Fetching ...

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang

TL;DR

UniREditBench tackles the limitations of existing image-editing benchmarks by providing a unified, reasoning-based evaluation across real-world and game-world tasks. It combines a multimodal dual-reference framework with a scalable data-synthesis pipeline to produce UniREdit-Data-100K and a fine-tuned UniREdit-Bagel model, achieving substantial gains in both in-domain and out-of-domain settings. The work delivers a comprehensive benchmark, large-scale CoT annotations, and detailed cross-model benchmarking (open- and closed-source), enabling more reliable assessment of reasoning abilities in image editing. Overall, the benchmark promises to advance instruction-following, multi-object interaction understanding, and rule-based editing in complex scenarios, with practical implications for real-world applications and game-like environments.

Abstract

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

TL;DR

UniREditBench tackles the limitations of existing image-editing benchmarks by providing a unified, reasoning-based evaluation across real-world and game-world tasks. It combines a multimodal dual-reference framework with a scalable data-synthesis pipeline to produce UniREdit-Data-100K and a fine-tuned UniREdit-Bagel model, achieving substantial gains in both in-domain and out-of-domain settings. The work delivers a comprehensive benchmark, large-scale CoT annotations, and detailed cross-model benchmarking (open- and closed-source), enabling more reliable assessment of reasoning abilities in image editing. Overall, the benchmark promises to advance instruction-following, multi-object interaction understanding, and rule-based editing in complex scenarios, with practical implications for real-world applications and game-like environments.

Abstract

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

Paper Structure

This paper contains 26 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: UniREditBench covers both real-world and game-world reasoning scenarios across 8 primary dimensions and 18 sub-dimensions. We provide qualitative editing cases of (a) real-world multi-object interaction, and (b) game-world logical/strategy reasoning.
  • Figure 2: Image editing evaluation comparison. Current text-reference-only evaluation potentially leads to misjudging, while our dual-reference evaluation results in more reliable assessments.
  • Figure 3: Multi-scenario data synthesis pipeline. (a) Real-world data synthesis pipeline; (b) Game-world data synthesis pipeline; and (c) Case study of our synthesized data.
  • Figure 4: Qualitative cases of evaluation dimensions in UniREditBench. We present qualitative examples for each dimension across both real-world and game-world scenarios.
  • Figure 5: Statistic visualization. We visualize (a) word clouds and (b) data distribition of our UniREdit-Data-100K.
  • ...and 8 more figures