Table of Contents
Fetching ...

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou

TL;DR

EdiVal-Agent introduces an automated, object-centric framework to evaluate multi-turn instruction-based image editing with three specialized metrics (EdiVal-IF, EdiVal-CC, EdiVal-VQ) and a dedicated benchmark (EdiVal-Bench). By decomposing scenes into semantic objects, generating diverse, context-aware edits, and combining symbolic, semantic, and human-preference assessments, it achieves more faithful alignment with human judgments than zero-shot VLM baselines. The framework reveals distinct strengths and failure modes across editing models, highlights exposure bias in open-source editors, and demonstrates the value of multi-turn evaluation for guiding model development. Overall, EdiVal-Agent provides a scalable, interpretable standard for diagnosing and advancing multi-turn, instruction-based visual editing.

Abstract

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

TL;DR

EdiVal-Agent introduces an automated, object-centric framework to evaluate multi-turn instruction-based image editing with three specialized metrics (EdiVal-IF, EdiVal-CC, EdiVal-VQ) and a dedicated benchmark (EdiVal-Bench). By decomposing scenes into semantic objects, generating diverse, context-aware edits, and combining symbolic, semantic, and human-preference assessments, it achieves more faithful alignment with human judgments than zero-shot VLM baselines. The framework reveals distinct strengths and failure modes across editing models, highlights exposure bias in open-source editors, and demonstrates the value of multi-turn evaluation for guiding model development. Overall, EdiVal-Agent provides a scalable, interpretable standard for diagnosing and advancing multi-turn, instruction-based visual editing.

Abstract

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

Paper Structure

This paper contains 54 sections, 3 equations, 17 figures, 19 tables, 11 algorithms.

Figures (17)

  • Figure 1: Overview of our workflow and representative model's performance. For visualization, we adopt two thresholds: a consistency score of at least $90$ and a visual quality score of at least $6$. Details of the automated evaluation pipeline are provided in Figure \ref{['fig:firstfigure']} and Section \ref{['sec:edival_agent']}. In multi-turn editing, models exhibit distinct weaknesses: GPT-Image-1 struggles with content consistency, Qwen-Image-Edit underperforms in both visual quality and content consistency, and FLUX.1-Kontext-dev lags in instruction following, whereas Nano Banana shows no single dominant weakness.
  • Figure 2: Framework of EdiVal-Agent. It first decomposes images into semantically meaningful objects, such as metal yellow sign and metal brown pole, and identifies their contextual relationships, e.g., they are both in foreground. It then generates diverse and proper editing scenarios at scale which are based on the initial analysis, e.g., Change the color of metal brown pole to gray. Finally, it systematically evaluates editing model outputs from multiple axes with our proposed metrics: EdiVal-IF, EdiVal-CC, and EdiVal-VQ. Our agentic pipeline is agnostic to the expert tools used and can be readily enhanced with more advanced tools in the future.
  • Figure 3: Beautification vs. preservation under the prompt: “Change the background to a library.” GPT-Image-1 tends to increase HPSv3 via beautification, while FLUX.1-Kontext-max emphasizes fidelity to the input.
  • Figure 4: Results of human agreement. Dashed lines represent the average accuracy of each method. EdiVal-IF achieves 81.3% human agreement accuracy, significantly outperforming the VLM (Qwen2.5-VL) at 75.2% and thresholded CLIP_dir at 65.4%. Note that the CLIP_dir threshold is tuned separately for each task.
  • Figure 5: Marginal Task Success rate across turns.
  • ...and 12 more figures