Table of Contents
Fetching ...

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen

Abstract

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

Abstract

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

Paper Structure

This paper contains 26 sections, 1 equation, 24 figures, 7 tables.

Figures (24)

  • Figure 1: GEditBench v2 spans 23 diverse image editing tasks, ranging from predefined edits to complex open-set real-world instructions, offering a comprehensive testbed for evaluating instruction-based image editing models. The central rose diagram visualizes the corresponding count distribution.
  • Figure 2: Human preference agreement of pointwise and pairwise evaluation paradigms across four VLMs in (a) instruction following, (b) visual quality, and (c) visual consistency dimensions. Pairwise evaluation consistently achieves higher agreement with human judgments, suggesting its superior human alignment over prior absolute scoring. NC and SC prompts are adopted from ye2025unicedit and luo2025editscore, respectively.
  • Figure 3: Two-stage candidates curation pipeline with prompt filtering.
  • Figure 4: Average accuracy of different image-instruction pairs per task on six representative tasks for visual consistency. Performance improves steadily and saturates around 1,500.
  • Figure 5: Preference data construction pipelines. (A) Object-centric pipeline: instructions are parsed to localize edited entities, partitioning the image into edit and non-edit regions. Region-specific metrics are then applied to enforce background fidelity while evaluating identity consistency within edited areas. (B) Human-centric pipeline: extends the object-centric pipeline by decomposing human attributes (face identity, body appearance, hair appearance) and dynamically excluding the edited attribute, enabling fine-grained consistency evaluation using specialized expert models.
  • ...and 19 more figures